Review

Today’s class was review.

Recording of lecture from 1/28/2022

import numpy as np
import pandas as pd
import altair as alt

Dictionaries, and their relationship to pandas Series and pandas DataFrames.

d = {"a":5,"b":10,"chris":30}
d["b"]
10
e = {"a":[5,6,7],"b":10,"chris":[30,2.4,-20]}

Making a DataFrame from a dictionary.

df = pd.DataFrame(e)
df
a b chris
0 5 10 30.0
1 6 10 2.4
2 7 10 -20.0
df.a
0    5
1    6
2    7
Name: a, dtype: int64
df["a"]
0    5
1    6
2    7
Name: a, dtype: int64

Making a Series from a dictionary.

pd.Series(d)
a         5
b        10
chris    30
dtype: int64
df
a b chris
0 5 10 30.0
1 6 10 2.4
2 7 10 -20.0
df.columns
Index(['a', 'b', 'chris'], dtype='object')

List comprehension

Practice exercise:

  • Make a list of Trues (bool type) using list comprehension, where the length of the list is the number of rows in df.

# These are strings, not bools
['true' for x in range(3)]
['true', 'true', 'true']
["true" for x in range(len(df))]
['true', 'true', 'true']
[true for x in range(len(df))]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3159580370.py in <module>
----> 1 [true for x in range(len(df))]

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3159580370.py in <listcomp>(.0)
----> 1 [true for x in range(len(df))]

NameError: name 'true' is not defined

Here is the correct answer.

[True for x in range(len(df))]
[True, True, True]

Putting a new column in a DataFrame

One way to create a new column.

df["new column"] = [True for x in range(len(df))]
df
a b chris new column
0 5 10 30.0 True
1 6 10 2.4 True
2 7 10 -20.0 True

If the new column is filled with a single value, then you can make the new column faster:

df["new column 2"] = False
df
a b chris new column new column 2
0 5 10 30.0 True False
1 6 10 2.4 True False
2 7 10 -20.0 True False

Indexing

Indexing using iloc.

df.iloc[1,0] = 20
df
a b chris new column new column 2
0 5 10 30.0 True False
1 20 10 2.4 True False
2 7 10 -20.0 True False

Indexing using loc.

df.loc[1,'a'] = 20
df
a b chris new column new column 2
0 5 10 30.0 True False
1 20 10 2.4 True False
2 7 10 -20.0 True False

Second largest value in a column.

Find the second largest value in the “a” column of df.

df["a"]
0     5
1    20
2     7
Name: a, dtype: int64
df["a"].sort_values()
0     5
2     7
1    20
Name: a, dtype: int64
df["a"].sort_values(ascending=False)
1    20
2     7
0     5
Name: a, dtype: int64

I kept accidentally trying to use [1] instead of iloc[1]. Here is the correct way to find the element at index 1 in a pandas Series.

df["a"].sort_values(ascending=False).iloc[1]
7

Without the ascending keyword argument, we have to take the second-to-last entry.

df["a"].sort_values().iloc[-2]
7

Practice with axis

We haven’t used median before, I don’t think, but the following should make sense. The most important part is recognizing what axis=0 means.

df
a b chris new column new column 2
0 5 10 30.0 True False
1 20 10 2.4 True False
2 7 10 -20.0 True False
df.median(axis=0)
a                7.0
b               10.0
chris            2.4
new column       1.0
new column 2     0.0
dtype: float64

Flattening a NumPy array

A = np.array([[2,5,1],[3,1,10]])
A.reshape((-1))
array([ 2,  5,  1,  3,  1, 10])

Slicing

df
a b chris new column new column 2
0 5 10 30.0 True False
1 20 10 2.4 True False
2 7 10 -20.0 True False

Access every other element in the row with label 1 using slicing

df.loc[1,::2]
a                  20
chris             2.4
new column 2    False
Name: 1, dtype: object

Change every other element in the row with label 1 using slicing

df.loc[1,::2] = [i**2 for i in range(3)]

Change every other element in the row with label 1 using slicing

df.loc[1,::2] = [0,1,4]
df
a b chris new column new column 2
0 5 10 30.0 True False
1 0 10 1.0 True 4
2 7 10 -20.0 True False

Square every element in a DataFrame

df2 = df.iloc[:,:3]
df2
a b chris
0 5 10 30.0
1 0 10 1.0
2 7 10 -20.0
df2.loc[1] = df2.loc[1]**2
df2
a b chris
0 5 10 30.0
1 0 100 1.0
2 7 10 -20.0
df2**2
a b chris
0 25 100 900.0
1 0 10000 1.0
2 49 100 400.0

Try doing that same thing (squaring every entry in df2) using map, apply, or applymap.

df2.applymap(lambda x: x**2)
a b chris
0 25 100 900.0
1 0 10000 1.0
2 49 100 400.0

Warning: notice that applymap doesn’t change the original DataFrame.

df2
a b chris
0 5 10 30.0
1 0 100 1.0
2 7 10 -20.0

Example using apply

df2.apply(lambda c: c.sum(), axis = 0)
a         12.0
b        120.0
chris     11.0
dtype: float64
df2.apply(lambda r: r.sum(), axis = 1)
0     45.0
1    101.0
2     -3.0
dtype: float64

What if we tried to use applymap instead?

df2.applymap(lambda a: a.sum())
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3466583239.py in <module>
----> 1 df2.applymap(lambda a: a.sum())

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/frame.py in applymap(self, func, na_action, **kwargs)
   8823             return lib.map_infer(x.astype(object)._values, func, ignore_na=ignore_na)
   8824 
-> 8825         return self.apply(infer).__finalize__(self, "applymap")
   8826 
   8827     # ----------------------------------------------------------------------

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs)
   8738             kwargs=kwargs,
   8739         )
-> 8740         return op.apply()
   8741 
   8742     def applymap(

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/apply.py in apply(self)
    686             return self.apply_raw()
    687 
--> 688         return self.apply_standard()
    689 
    690     def agg(self):

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
    810 
    811     def apply_standard(self):
--> 812         results, res_index = self.apply_series_generator()
    813 
    814         # wrap results

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
    826             for i, v in enumerate(series_gen):
    827                 # ignore SettingWithCopy here in case the user mutates
--> 828                 results[i] = self.f(v)
    829                 if isinstance(results[i], ABCSeries):
    830                     # If we have a view on v, we need to make a copy because

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/frame.py in infer(x)
   8821             if x.empty:
   8822                 return lib.map_infer(x, func, ignore_na=ignore_na)
-> 8823             return lib.map_infer(x.astype(object)._values, func, ignore_na=ignore_na)
   8824 
   8825         return self.apply(infer).__finalize__(self, "applymap")

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3466583239.py in <lambda>(a)
----> 1 df2.applymap(lambda a: a.sum())

AttributeError: 'int' object has no attribute 'sum'

Sample exercise: What causes the above error?

Sample answer: a will be a number in the dataframe, and number.sum() does not make sense. Should use apply and axis instead.

A smaller piece of code that raises the same error:

a = 5.1
a.sum()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/223743580.py in <module>
      1 a = 5.1
----> 2 a.sum()

AttributeError: 'float' object has no attribute 'sum'

Of the three methods, map, applymap, and apply, definitely apply is the trickiest to understand. The most natural example with apply we have seen was using pd.to_numeric on every column. Notice how the input to pd.to_numeric should be an entire Series, not an individual entry.

Working with datetime entries

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.columns
Index(['Index', 'Highest Charting Position', 'Number of Times Charted',
       'Week of Highest Charting', 'Song Name', 'Streams', 'Artist',
       'Artist Followers', 'Song ID', 'Genre', 'Release Date', 'Weeks Charted',
       'Popularity', 'Danceability', 'Energy', 'Loudness', 'Speechiness',
       'Acousticness', 'Liveness', 'Tempo', 'Duration (ms)', 'Valence',
       'Chord'],
      dtype='object')

This way of getting a column from a DataFrame is called attribute access:

df.Genre
0                  ['indie rock italiano', 'italian pop']
1                                  ['australian hip hop']
2                                                 ['pop']
3                                       ['pop', 'uk pop']
4                           ['lgbtq+ hip hop', 'pop rap']
                              ...                        
1551                       ['dance pop', 'pop', 'uk pop']
1552             ['sertanejo', 'sertanejo universitario']
1553    ['dance pop', 'electropop', 'pop', 'post-teen ...
1554                       ['brega funk', 'funk carioca']
1555                             ['pop', 'post-teen pop']
Name: Genre, Length: 1556, dtype: object

It doesn’t always work. For example, in the following, we can’t use attribute access because Release Date has a space in it.

df.Release Date
  File "/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3791632194.py", line 1
    df.Release Date
               ^
SyntaxError: invalid syntax
df["Release Date"]
0       2017-12-08
1       2021-07-09
2       2021-05-21
3       2021-06-25
4       2021-07-23
           ...    
1551    2017-06-02
1552    2019-10-11
1553    2018-01-12
1554    2019-09-25
1555    2019-11-13
Name: Release Date, Length: 1556, dtype: object
pd.to_datetime(df["Release Date"]).dt.day
0        8.0
1        9.0
2       21.0
3       25.0
4       23.0
        ... 
1551     2.0
1552    11.0
1553    12.0
1554    25.0
1555    13.0
Name: Release Date, Length: 1556, dtype: float64
year_series = pd.to_datetime(df["Release Date"]).dt.year

Sample exercise: How many 2019s are there in year_series?

(year_series == 2019).sum()
181

Here we’re trying a different method, but it doesn’t work at first because of null values. I realized after class that we could have used a keyword argument to map called na_action, but during class we removed the null values by hand.

df["Release Date"].map(lambda s: s[:4] == 2019)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/337003375.py in <module>
----> 1 df["Release Date"].map(lambda s: s[:4] == 2019)

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/series.py in map(self, arg, na_action)
   4159         dtype: object
   4160         """
-> 4161         new_values = super()._map_values(arg, na_action=na_action)
   4162         return self._constructor(new_values, index=self.index).__finalize__(
   4163             self, method="map"

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
    868 
    869         # mapper is a function
--> 870         new_values = map_f(values, mapper)
    871 
    872         return new_values

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/337003375.py in <lambda>(s)
----> 1 df["Release Date"].map(lambda s: s[:4] == 2019)

TypeError: 'float' object is not subscriptable
np.nan[:4]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/1812359730.py in <module>
----> 1 np.nan[:4]

TypeError: 'float' object is not subscriptable

During class I made another mistake before getting to the next cell, but I deleted it from this notebook because it’s more confusing than helpful.

Here is an alternate approach.

clean = df["Release Date"][~df["Release Date"].isna()]
type(clean)
pandas.core.series.Series
clean
0       2017-12-08
1       2021-07-09
2       2021-05-21
3       2021-06-25
4       2021-07-23
           ...    
1551    2017-06-02
1552    2019-10-11
1553    2018-01-12
1554    2019-09-25
1555    2019-11-13
Name: Release Date, Length: 1545, dtype: object

Two different ways to count 2019s in this pandas Series of strings. Notice that they give the same answer as the .dt.year method from above.

clean.map(lambda s: s[:4] == "2019").sum()
181
clean.map(lambda s: int(s[:4]) == 2019).sum()
181