Review
Contents
Review¶
Today’s class was review.
Recording of lecture from 1/28/2022
import numpy as np
import pandas as pd
import altair as alt
Dictionaries, and their relationship to pandas Series and pandas DataFrames.¶
d = {"a":5,"b":10,"chris":30}
d["b"]
10
e = {"a":[5,6,7],"b":10,"chris":[30,2.4,-20]}
Making a DataFrame from a dictionary.
df = pd.DataFrame(e)
df
a | b | chris | |
---|---|---|---|
0 | 5 | 10 | 30.0 |
1 | 6 | 10 | 2.4 |
2 | 7 | 10 | -20.0 |
df.a
0 5
1 6
2 7
Name: a, dtype: int64
df["a"]
0 5
1 6
2 7
Name: a, dtype: int64
Making a Series from a dictionary.
pd.Series(d)
a 5
b 10
chris 30
dtype: int64
df
a | b | chris | |
---|---|---|---|
0 | 5 | 10 | 30.0 |
1 | 6 | 10 | 2.4 |
2 | 7 | 10 | -20.0 |
df.columns
Index(['a', 'b', 'chris'], dtype='object')
List comprehension¶
Practice exercise:
Make a list of
True
s (bool type) using list comprehension, where the length of the list is the number of rows indf
.
# These are strings, not bools
['true' for x in range(3)]
['true', 'true', 'true']
["true" for x in range(len(df))]
['true', 'true', 'true']
[true for x in range(len(df))]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3159580370.py in <module>
----> 1 [true for x in range(len(df))]
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3159580370.py in <listcomp>(.0)
----> 1 [true for x in range(len(df))]
NameError: name 'true' is not defined
Here is the correct answer.
[True for x in range(len(df))]
[True, True, True]
Putting a new column in a DataFrame¶
One way to create a new column.
df["new column"] = [True for x in range(len(df))]
df
a | b | chris | new column | |
---|---|---|---|---|
0 | 5 | 10 | 30.0 | True |
1 | 6 | 10 | 2.4 | True |
2 | 7 | 10 | -20.0 | True |
If the new column is filled with a single value, then you can make the new column faster:
df["new column 2"] = False
df
a | b | chris | new column | new column 2 | |
---|---|---|---|---|---|
0 | 5 | 10 | 30.0 | True | False |
1 | 6 | 10 | 2.4 | True | False |
2 | 7 | 10 | -20.0 | True | False |
Indexing¶
Indexing using iloc
.
df.iloc[1,0] = 20
df
a | b | chris | new column | new column 2 | |
---|---|---|---|---|---|
0 | 5 | 10 | 30.0 | True | False |
1 | 20 | 10 | 2.4 | True | False |
2 | 7 | 10 | -20.0 | True | False |
Indexing using loc
.
df.loc[1,'a'] = 20
df
a | b | chris | new column | new column 2 | |
---|---|---|---|---|---|
0 | 5 | 10 | 30.0 | True | False |
1 | 20 | 10 | 2.4 | True | False |
2 | 7 | 10 | -20.0 | True | False |
Second largest value in a column.¶
Find the second largest value in the “a” column of df.
df["a"]
0 5
1 20
2 7
Name: a, dtype: int64
df["a"].sort_values()
0 5
2 7
1 20
Name: a, dtype: int64
df["a"].sort_values(ascending=False)
1 20
2 7
0 5
Name: a, dtype: int64
I kept accidentally trying to use [1]
instead of iloc[1]
. Here is the correct way to find the element at index 1 in a pandas Series.
df["a"].sort_values(ascending=False).iloc[1]
7
Without the ascending
keyword argument, we have to take the second-to-last entry.
df["a"].sort_values().iloc[-2]
7
Practice with axis¶
We haven’t used median
before, I don’t think, but the following should make sense. The most important part is recognizing what axis=0
means.
df
a | b | chris | new column | new column 2 | |
---|---|---|---|---|---|
0 | 5 | 10 | 30.0 | True | False |
1 | 20 | 10 | 2.4 | True | False |
2 | 7 | 10 | -20.0 | True | False |
df.median(axis=0)
a 7.0
b 10.0
chris 2.4
new column 1.0
new column 2 0.0
dtype: float64
Flattening a NumPy array¶
A = np.array([[2,5,1],[3,1,10]])
A.reshape((-1))
array([ 2, 5, 1, 3, 1, 10])
Slicing¶
df
a | b | chris | new column | new column 2 | |
---|---|---|---|---|---|
0 | 5 | 10 | 30.0 | True | False |
1 | 20 | 10 | 2.4 | True | False |
2 | 7 | 10 | -20.0 | True | False |
Access every other element in the row with label 1 using slicing
df.loc[1,::2]
a 20
chris 2.4
new column 2 False
Name: 1, dtype: object
Change every other element in the row with label 1 using slicing
df.loc[1,::2] = [i**2 for i in range(3)]
Change every other element in the row with label 1 using slicing
df.loc[1,::2] = [0,1,4]
df
a | b | chris | new column | new column 2 | |
---|---|---|---|---|---|
0 | 5 | 10 | 30.0 | True | False |
1 | 0 | 10 | 1.0 | True | 4 |
2 | 7 | 10 | -20.0 | True | False |
Square every element in a DataFrame¶
df2 = df.iloc[:,:3]
df2
a | b | chris | |
---|---|---|---|
0 | 5 | 10 | 30.0 |
1 | 0 | 10 | 1.0 |
2 | 7 | 10 | -20.0 |
df2.loc[1] = df2.loc[1]**2
df2
a | b | chris | |
---|---|---|---|
0 | 5 | 10 | 30.0 |
1 | 0 | 100 | 1.0 |
2 | 7 | 10 | -20.0 |
df2**2
a | b | chris | |
---|---|---|---|
0 | 25 | 100 | 900.0 |
1 | 0 | 10000 | 1.0 |
2 | 49 | 100 | 400.0 |
Try doing that same thing (squaring every entry in df2) using map
, apply
, or applymap
.
df2.applymap(lambda x: x**2)
a | b | chris | |
---|---|---|---|
0 | 25 | 100 | 900.0 |
1 | 0 | 10000 | 1.0 |
2 | 49 | 100 | 400.0 |
Warning: notice that applymap
doesn’t change the original DataFrame.
df2
a | b | chris | |
---|---|---|---|
0 | 5 | 10 | 30.0 |
1 | 0 | 100 | 1.0 |
2 | 7 | 10 | -20.0 |
Example using apply¶
df2.apply(lambda c: c.sum(), axis = 0)
a 12.0
b 120.0
chris 11.0
dtype: float64
df2.apply(lambda r: r.sum(), axis = 1)
0 45.0
1 101.0
2 -3.0
dtype: float64
What if we tried to use applymap
instead?
df2.applymap(lambda a: a.sum())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3466583239.py in <module>
----> 1 df2.applymap(lambda a: a.sum())
~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/frame.py in applymap(self, func, na_action, **kwargs)
8823 return lib.map_infer(x.astype(object)._values, func, ignore_na=ignore_na)
8824
-> 8825 return self.apply(infer).__finalize__(self, "applymap")
8826
8827 # ----------------------------------------------------------------------
~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs)
8738 kwargs=kwargs,
8739 )
-> 8740 return op.apply()
8741
8742 def applymap(
~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/apply.py in apply(self)
686 return self.apply_raw()
687
--> 688 return self.apply_standard()
689
690 def agg(self):
~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
810
811 def apply_standard(self):
--> 812 results, res_index = self.apply_series_generator()
813
814 # wrap results
~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
826 for i, v in enumerate(series_gen):
827 # ignore SettingWithCopy here in case the user mutates
--> 828 results[i] = self.f(v)
829 if isinstance(results[i], ABCSeries):
830 # If we have a view on v, we need to make a copy because
~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/frame.py in infer(x)
8821 if x.empty:
8822 return lib.map_infer(x, func, ignore_na=ignore_na)
-> 8823 return lib.map_infer(x.astype(object)._values, func, ignore_na=ignore_na)
8824
8825 return self.apply(infer).__finalize__(self, "applymap")
~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3466583239.py in <lambda>(a)
----> 1 df2.applymap(lambda a: a.sum())
AttributeError: 'int' object has no attribute 'sum'
Sample exercise: What causes the above error?
Sample answer: a
will be a number in the dataframe, and number.sum()
does not make sense. Should use apply
and axis
instead.
A smaller piece of code that raises the same error:
a = 5.1
a.sum()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/223743580.py in <module>
1 a = 5.1
----> 2 a.sum()
AttributeError: 'float' object has no attribute 'sum'
Of the three methods, map
, applymap
, and apply
, definitely apply
is the trickiest to understand. The most natural example with apply
we have seen was using pd.to_numeric
on every column. Notice how the input to pd.to_numeric
should be an entire Series, not an individual entry.
Working with datetime entries¶
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df.columns
Index(['Index', 'Highest Charting Position', 'Number of Times Charted',
'Week of Highest Charting', 'Song Name', 'Streams', 'Artist',
'Artist Followers', 'Song ID', 'Genre', 'Release Date', 'Weeks Charted',
'Popularity', 'Danceability', 'Energy', 'Loudness', 'Speechiness',
'Acousticness', 'Liveness', 'Tempo', 'Duration (ms)', 'Valence',
'Chord'],
dtype='object')
This way of getting a column from a DataFrame is called attribute access:
df.Genre
0 ['indie rock italiano', 'italian pop']
1 ['australian hip hop']
2 ['pop']
3 ['pop', 'uk pop']
4 ['lgbtq+ hip hop', 'pop rap']
...
1551 ['dance pop', 'pop', 'uk pop']
1552 ['sertanejo', 'sertanejo universitario']
1553 ['dance pop', 'electropop', 'pop', 'post-teen ...
1554 ['brega funk', 'funk carioca']
1555 ['pop', 'post-teen pop']
Name: Genre, Length: 1556, dtype: object
It doesn’t always work. For example, in the following, we can’t use attribute access because Release Date has a space in it.
df.Release Date
File "/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/3791632194.py", line 1
df.Release Date
^
SyntaxError: invalid syntax
df["Release Date"]
0 2017-12-08
1 2021-07-09
2 2021-05-21
3 2021-06-25
4 2021-07-23
...
1551 2017-06-02
1552 2019-10-11
1553 2018-01-12
1554 2019-09-25
1555 2019-11-13
Name: Release Date, Length: 1556, dtype: object
pd.to_datetime(df["Release Date"]).dt.day
0 8.0
1 9.0
2 21.0
3 25.0
4 23.0
...
1551 2.0
1552 11.0
1553 12.0
1554 25.0
1555 13.0
Name: Release Date, Length: 1556, dtype: float64
year_series = pd.to_datetime(df["Release Date"]).dt.year
Sample exercise: How many 2019
s are there in year_series
?
(year_series == 2019).sum()
181
Here we’re trying a different method, but it doesn’t work at first because of null values. I realized after class that we could have used a keyword argument to map
called na_action
, but during class we removed the null values by hand.
df["Release Date"].map(lambda s: s[:4] == 2019)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/337003375.py in <module>
----> 1 df["Release Date"].map(lambda s: s[:4] == 2019)
~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/series.py in map(self, arg, na_action)
4159 dtype: object
4160 """
-> 4161 new_values = super()._map_values(arg, na_action=na_action)
4162 return self._constructor(new_values, index=self.index).__finalize__(
4163 self, method="map"
~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
868
869 # mapper is a function
--> 870 new_values = map_f(values, mapper)
871
872 return new_values
~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/337003375.py in <lambda>(s)
----> 1 df["Release Date"].map(lambda s: s[:4] == 2019)
TypeError: 'float' object is not subscriptable
np.nan[:4]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5575/1812359730.py in <module>
----> 1 np.nan[:4]
TypeError: 'float' object is not subscriptable
During class I made another mistake before getting to the next cell, but I deleted it from this notebook because it’s more confusing than helpful.
Here is an alternate approach.
clean = df["Release Date"][~df["Release Date"].isna()]
type(clean)
pandas.core.series.Series
clean
0 2017-12-08
1 2021-07-09
2 2021-05-21
3 2021-06-25
4 2021-07-23
...
1551 2017-06-02
1552 2019-10-11
1553 2018-01-12
1554 2019-09-25
1555 2019-11-13
Name: Release Date, Length: 1545, dtype: object
Two different ways to count 2019
s in this pandas Series of strings. Notice that they give the same answer as the .dt.year
method from above.
clean.map(lambda s: s[:4] == "2019").sum()
181
clean.map(lambda s: int(s[:4]) == 2019).sum()
181