Review¶

Today’s class was review.

Recording of lecture from 1/28/2022

import numpy as np
import pandas as pd
import altair as alt

Dictionaries, and their relationship to pandas Series and pandas DataFrames.¶

d = {"a":5,"b":10,"chris":30}

d["b"]

e = {"a":[5,6,7],"b":10,"chris":[30,2.4,-20]}

Making a DataFrame from a dictionary.

df = pd.DataFrame(e)

df

	a	b	chris
0	5	10	30.0
1	6	10	2.4
2	7	10	-20.0

df.a

  5
  6
  7
Name: a, dtype: int64

df["a"]

  5
  6
  7
Name: a, dtype: int64

Making a Series from a dictionary.

pd.Series(d)

a         5
b        10
chris    30
dtype: int64

df

	a	b	chris
0	5	10	30.0
1	6	10	2.4
2	7	10	-20.0

df.columns

Index(['a', 'b', 'chris'], dtype='object')

List comprehension¶

Practice exercise:

Make a list of Trues (bool type) using list comprehension, where the length of the list is the number of rows in df.

# These are strings, not bools
['true' for x in range(3)]

['true', 'true', 'true']

["true" for x in range(len(df))]

['true', 'true', 'true']

[true for x in range(len(df))]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3159580370.py in <module>
----> 1 [true for x in range(len(df))]

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3159580370.py in <listcomp>(.0)
----> 1 [true for x in range(len(df))]

NameError: name 'true' is not defined

Here is the correct answer.

[True for x in range(len(df))]

[True, True, True]

Putting a new column in a DataFrame¶

One way to create a new column.

df["new column"] = [True for x in range(len(df))]

df

	a	b	chris	new column
0	5	10	30.0	True
1	6	10	2.4	True
2	7	10	-20.0	True

If the new column is filled with a single value, then you can make the new column faster:

df["new column 2"] = False

df

	a	b	chris	new column	new column 2
0	5	10	30.0	True	False
1	6	10	2.4	True	False
2	7	10	-20.0	True	False

Indexing¶

Indexing using iloc.

df.iloc[1,0] = 20
df

	a	b	chris	new column	new column 2
0	5	10	30.0	True	False
1	20	10	2.4	True	False
2	7	10	-20.0	True	False

Indexing using loc.

df.loc[1,'a'] = 20
df

	a	b	chris	new column	new column 2
0	5	10	30.0	True	False
1	20	10	2.4	True	False
2	7	10	-20.0	True	False

Second largest value in a column.¶

Find the second largest value in the “a” column of df.

df["a"]

   5
  20
   7
Name: a, dtype: int64

df["a"].sort_values()

   5
   7
  20
Name: a, dtype: int64

df["a"].sort_values(ascending=False)

  20
   7
   5
Name: a, dtype: int64

I kept accidentally trying to use [1] instead of iloc[1]. Here is the correct way to find the element at index 1 in a pandas Series.

df["a"].sort_values(ascending=False).iloc[1]

Without the ascending keyword argument, we have to take the second-to-last entry.

df["a"].sort_values().iloc[-2]

Practice with axis¶

We haven’t used median before, I don’t think, but the following should make sense. The most important part is recognizing what axis=0 means.

df

	a	b	chris	new column	new column 2
0	5	10	30.0	True	False
1	20	10	2.4	True	False
2	7	10	-20.0	True	False

df.median(axis=0)

a                7.0
b               10.0
chris            2.4
new column       1.0
new column 2     0.0
dtype: float64

Flattening a NumPy array¶

A = np.array([[2,5,1],[3,1,10]])

A.reshape((-1))

array([ 2,  5,  1,  3,  1, 10])

Slicing¶

df

	a	b	chris	new column	new column 2
0	5	10	30.0	True	False
1	20	10	2.4	True	False
2	7	10	-20.0	True	False

Access every other element in the row with label 1 using slicing

df.loc[1,::2]

a                  20
chris             2.4
new column 2    False
Name: 1, dtype: object

Change every other element in the row with label 1 using slicing

df.loc[1,::2] = [i**2 for i in range(3)]

Change every other element in the row with label 1 using slicing

df.loc[1,::2] = [0,1,4]

df

	a	b	chris	new column	new column 2
0	5	10	30.0	True	False
1	0	10	1.0	True	4
2	7	10	-20.0	True	False

Square every element in a DataFrame¶

df2 = df.iloc[:,:3]
df2

	a	b	chris
0	5	10	30.0
1	0	10	1.0
2	7	10	-20.0

df2.loc[1] = df2.loc[1]**2

df2

	a	b	chris
0	5	10	30.0
1	0	100	1.0
2	7	10	-20.0

df2**2

	a	b	chris
0	25	100	900.0
1	0	10000	1.0
2	49	100	400.0

Try doing that same thing (squaring every entry in df2) using map, apply, or applymap.

df2.applymap(lambda x: x**2)

	a	b	chris
0	25	100	900.0
1	0	10000	1.0
2	49	100	400.0

Warning: notice that applymap doesn’t change the original DataFrame.

df2

	a	b	chris
0	5	10	30.0
1	0	100	1.0
2	7	10	-20.0

Example using apply¶

df2.apply(lambda c: c.sum(), axis = 0)

a         12.0
b        120.0
chris     11.0
dtype: float64

df2.apply(lambda r: r.sum(), axis = 1)

   45.0
  101.0
   -3.0
dtype: float64

What if we tried to use applymap instead?

df2.applymap(lambda a: a.sum())

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3466583239.py in <module>
----> 1 df2.applymap(lambda a: a.sum())

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/frame.py in applymap(self, func, na_action, **kwargs)
   8823             return lib.map_infer(x.astype(object)._values, func, ignore_na=ignore_na)
   8824 
-> 8825         return self.apply(infer).__finalize__(self, "applymap")
   8826 
   8827     # ----------------------------------------------------------------------

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs)
   8738             kwargs=kwargs,
   8739         )
-> 8740         return op.apply()
   8741 
   8742     def applymap(

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/apply.py in apply(self)
    686             return self.apply_raw()
    687 
--> 688         return self.apply_standard()
    689 
    690     def agg(self):

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
    810 
    811     def apply_standard(self):
--> 812         results, res_index = self.apply_series_generator()
    813 
    814         # wrap results

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
    826             for i, v in enumerate(series_gen):
    827                 # ignore SettingWithCopy here in case the user mutates
--> 828                 results[i] = self.f(v)
    829                 if isinstance(results[i], ABCSeries):
    830                     # If we have a view on v, we need to make a copy because

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/frame.py in infer(x)
   8821             if x.empty:
   8822                 return lib.map_infer(x, func, ignore_na=ignore_na)
-> 8823             return lib.map_infer(x.astype(object)._values, func, ignore_na=ignore_na)
   8824 
   8825         return self.apply(infer).__finalize__(self, "applymap")

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3466583239.py in <lambda>(a)
----> 1 df2.applymap(lambda a: a.sum())

AttributeError: 'int' object has no attribute 'sum'

Sample exercise: What causes the above error?

Sample answer: a will be a number in the dataframe, and number.sum() does not make sense. Should use apply and axis instead.

A smaller piece of code that raises the same error:

a = 5.1
a.sum()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/223743580.py in <module>
      1 a = 5.1
----> 2 a.sum()

AttributeError: 'float' object has no attribute 'sum'

Of the three methods, map, applymap, and apply, definitely apply is the trickiest to understand. The most natural example with apply we have seen was using pd.to_numeric on every column. Notice how the input to pd.to_numeric should be an entire Series, not an individual entry.

Working with datetime entries¶

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")

df.columns

Index(['Index', 'Highest Charting Position', 'Number of Times Charted',
       'Week of Highest Charting', 'Song Name', 'Streams', 'Artist',
       'Artist Followers', 'Song ID', 'Genre', 'Release Date', 'Weeks Charted',
       'Popularity', 'Danceability', 'Energy', 'Loudness', 'Speechiness',
       'Acousticness', 'Liveness', 'Tempo', 'Duration (ms)', 'Valence',
       'Chord'],
      dtype='object')

This way of getting a column from a DataFrame is called attribute access:

df.Genre

                ['indie rock italiano', 'italian pop']
                                ['australian hip hop']
                                               ['pop']
                                     ['pop', 'uk pop']
                         ['lgbtq+ hip hop', 'pop rap']
                              ...                        
                     ['dance pop', 'pop', 'uk pop']
           ['sertanejo', 'sertanejo universitario']
  ['dance pop', 'electropop', 'pop', 'post-teen ...
                     ['brega funk', 'funk carioca']
                           ['pop', 'post-teen pop']
Name: Genre, Length: 1556, dtype: object

It doesn’t always work. For example, in the following, we can’t use attribute access because Release Date has a space in it.

df.Release Date

  File "/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3791632194.py", line 1
    df.Release Date
               ^
SyntaxError: invalid syntax

df["Release Date"]

     2017-12-08
     2021-07-09
     2021-05-21
     2021-06-25
     2021-07-23
           ...    
  2017-06-02
  2019-10-11
  2018-01-12
  2019-09-25
  2019-11-13
Name: Release Date, Length: 1556, dtype: object

pd.to_datetime(df["Release Date"]).dt.day

      8.0
      9.0
     21.0
     25.0
     23.0
        ... 
   2.0
  11.0
  12.0
  25.0
  13.0
Name: Release Date, Length: 1556, dtype: float64

year_series = pd.to_datetime(df["Release Date"]).dt.year

Sample exercise: How many 2019s are there in year_series?

(year_series == 2019).sum()

Here we’re trying a different method, but it doesn’t work at first because of null values. I realized after class that we could have used a keyword argument to map called na_action, but during class we removed the null values by hand.

df["Release Date"].map(lambda s: s[:4] == 2019)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/337003375.py in <module>
----> 1 df["Release Date"].map(lambda s: s[:4] == 2019)

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/series.py in map(self, arg, na_action)
   4159         dtype: object
   4160         """
-> 4161         new_values = super()._map_values(arg, na_action=na_action)
   4162         return self._constructor(new_values, index=self.index).__finalize__(
   4163             self, method="map"

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
    868 
    869         # mapper is a function
--> 870         new_values = map_f(values, mapper)
    871 
    872         return new_values

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/337003375.py in <lambda>(s)
----> 1 df["Release Date"].map(lambda s: s[:4] == 2019)

TypeError: 'float' object is not subscriptable

np.nan[:4]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/1812359730.py in <module>
----> 1 np.nan[:4]

TypeError: 'float' object is not subscriptable

During class I made another mistake before getting to the next cell, but I deleted it from this notebook because it’s more confusing than helpful.

Here is an alternate approach.

clean = df["Release Date"][~df["Release Date"].isna()]

type(clean)

pandas.core.series.Series

clean

     2017-12-08
     2021-07-09
     2021-05-21
     2021-06-25
     2021-07-23
           ...    
  2017-06-02
  2019-10-11
  2018-01-12
  2019-09-25
  2019-11-13
Name: Release Date, Length: 1545, dtype: object

Two different ways to count 2019s in this pandas Series of strings. Notice that they give the same answer as the .dt.year method from above.

clean.map(lambda s: s[:4] == "2019").sum()

clean.map(lambda s: int(s[:4]) == 2019).sum()

UC Irvine Math 10 W22

Review

Contents