# Review

Today's class was review.

[Recording of lecture from 1/28/2022](https://uci.zoom.us/rec/share/OxfhTaj6tVZ4ysRJC_U2Nj3vGo0Hiqn-vm8WgGK-ANcfV5hahZLYPRCQFj60xI2a.MTVnvxxTkIXzkvLS?startTime=1643385481000)

In [1]:
import numpy as np
import pandas as pd
import altair as alt

## Dictionaries, and their relationship to pandas Series and pandas DataFrames.

In [10]:
d = {"a":5,"b":10,"chris":30}

In [None]:
d["b"]

10

In [4]:
e = {"a":[5,6,7],"b":10,"chris":[30,2.4,-20]}

Making a DataFrame from a dictionary.

In [5]:
df = pd.DataFrame(e)

In [6]:
df

Unnamed: 0,a,b,chris
0,5,10,30.0
1,6,10,2.4
2,7,10,-20.0


In [7]:
df.a

0    5
1    6
2    7
Name: a, dtype: int64

In [8]:
df["a"]

0    5
1    6
2    7
Name: a, dtype: int64

Making a Series from a dictionary.

In [11]:
pd.Series(d)

a         5
b        10
chris    30
dtype: int64

In [None]:
df

Unnamed: 0,a,b,chris
0,5,10,30.0
1,6,10,2.4
2,7,10,-20.0


In [None]:
df.columns

Index(['a', 'b', 'chris'], dtype='object')

## List comprehension

Practice exercise:
* Make a list of `True`s (bool type) using list comprehension, where the length of the list is the number of rows in `df`.

In [2]:
# These are strings, not bools
['true' for x in range(3)]

['true', 'true', 'true']

In [12]:
["true" for x in range(len(df))]

['true', 'true', 'true']

In [13]:
[true for x in range(len(df))]

NameError: name 'true' is not defined

Here is the correct answer.

In [14]:
[True for x in range(len(df))]

[True, True, True]

## Putting a new column in a DataFrame

One way to create a new column.

In [15]:
df["new column"] = [True for x in range(len(df))]

In [16]:
df

Unnamed: 0,a,b,chris,new column
0,5,10,30.0,True
1,6,10,2.4,True
2,7,10,-20.0,True


If the new column is filled with a single value, then you can make the new column faster:

In [None]:
df["new column 2"] = False

In [None]:
df

Unnamed: 0,a,b,chris,new column,new column 2
0,5,10,30.0,True,False
1,6,10,2.4,True,False
2,7,10,-20.0,True,False


## Indexing

Indexing using `iloc`.

In [17]:
df.iloc[1,0] = 20
df

Unnamed: 0,a,b,chris,new column
0,5,10,30.0,True
1,20,10,2.4,True
2,7,10,-20.0,True


Indexing using `loc`.

In [18]:
df.loc[1,'a'] = 20
df

Unnamed: 0,a,b,chris,new column
0,5,10,30.0,True
1,20,10,2.4,True
2,7,10,-20.0,True


## Second largest value in a column.

Find the second largest value in the "a" column of df.

In [None]:
df["a"]

0     5
1    20
2     7
Name: a, dtype: int64

In [None]:
df["a"].sort_values()

0     5
2     7
1    20
Name: a, dtype: int64

In [None]:
df["a"].sort_values(ascending=False)

1    20
2     7
0     5
Name: a, dtype: int64

I kept accidentally trying to use `[1]` instead of `iloc[1]`.  Here is the correct way to find the element at index 1 in a pandas Series.

In [19]:
df["a"].sort_values(ascending=False).iloc[1]

7

Without the `ascending` keyword argument, we have to take the second-to-last entry.

In [21]:
df["a"].sort_values().iloc[-2]

7

## Practice with axis

We haven't used `median` before, I don't think, but the following should make sense.  The most important part is recognizing what `axis=0` means.

In [22]:
df

Unnamed: 0,a,b,chris,new column
0,5,10,30.0,True
1,20,10,2.4,True
2,7,10,-20.0,True


In [23]:
df.median(axis=0)

a              7.0
b             10.0
chris          2.4
new column     1.0
dtype: float64

## Flattening a NumPy array

In [24]:
A = np.array([[2,5,1],[3,1,10]])

In [25]:
A.reshape((-1))

array([ 2,  5,  1,  3,  1, 10])

## Slicing

In [None]:
df

Unnamed: 0,a,b,chris,new column,new column 2
0,5,10,30.0,True,False
1,20,10,2.4,True,False
2,7,10,-20.0,True,False


Access every other element in the row with label 1 using slicing

In [26]:
df.loc[1,::2]

a         20
chris    2.4
Name: 1, dtype: object

Change every other element in the row with label 1 using slicing

In [27]:
df.loc[1,::2] = [i**2 for i in range(3)]

ValueError: Must have equal len keys and value when setting with an iterable

Change every other element in the row with label 1 using slicing

In [None]:
df.loc[1,::2] = [0,1,4]

In [None]:
df

Unnamed: 0,a,b,chris,new column,new column 2
0,5,10,30.0,True,False
1,0,10,1.0,True,4
2,7,10,-20.0,True,False


## Square every element in a DataFrame

In [28]:
df2 = df.iloc[:,:3]
df2

Unnamed: 0,a,b,chris
0,5,10,30.0
1,20,10,2.4
2,7,10,-20.0


In [None]:
df2.loc[1] = df2.loc[1]**2

In [None]:
df2

Unnamed: 0,a,b,chris
0,5.0,10.0,30.0
1,0.0,100.0,1.0
2,7.0,10.0,-20.0


In [29]:
df2**2

Unnamed: 0,a,b,chris
0,25,100,900.0
1,400,100,5.76
2,49,100,400.0


Try doing that same thing (squaring every entry in df2) using `map`, `apply`, or `applymap`.

In [30]:
df2.applymap(lambda x: x**2)

Unnamed: 0,a,b,chris
0,25,100,900.0
1,400,100,5.76
2,49,100,400.0


**Warning**: notice that `applymap` doesn't change the original DataFrame.

In [32]:
df2

Unnamed: 0,a,b,chris
0,5,10,30.0
1,20,10,2.4
2,7,10,-20.0


## Example using apply

In [33]:
df2.apply(lambda c: c.sum(), axis = 0)

a        32.0
b        30.0
chris    12.4
dtype: float64

In [34]:
df2.apply(lambda r: r.sum(), axis = 1)

0    45.0
1    32.4
2    -3.0
dtype: float64

What if we tried to use `applymap` instead?

In [None]:
df2.applymap(lambda a: a.sum())

AttributeError: 'float' object has no attribute 'sum'

Sample exercise: What causes the above error?

Sample answer: `a` will be a number in the dataframe, and `number.sum()` does not make sense.  Should use `apply` and `axis` instead.

A smaller piece of code that raises the same error:

In [36]:
a = 5.1
a.sum()

AttributeError: 'float' object has no attribute 'sum'

Of the three methods, `map`, `applymap`, and `apply`, definitely `apply` is the trickiest to understand.  The most natural example with `apply` we have seen was using `pd.to_numeric` on every column.  Notice how the input to `pd.to_numeric` should be an entire Series, not an individual entry.

## Working with datetime entries

In [37]:
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")

In [38]:
df.columns

Index(['Index', 'Highest Charting Position', 'Number of Times Charted',
       'Week of Highest Charting', 'Song Name', 'Streams', 'Artist',
       'Artist Followers', 'Song ID', 'Genre', 'Release Date', 'Weeks Charted',
       'Popularity', 'Danceability', 'Energy', 'Loudness', 'Speechiness',
       'Acousticness', 'Liveness', 'Tempo', 'Duration (ms)', 'Valence',
       'Chord'],
      dtype='object')

This way of getting a column from a DataFrame is called *attribute* access:

In [40]:
df.Genre

0                  ['indie rock italiano', 'italian pop']
1                                  ['australian hip hop']
2                                                 ['pop']
3                                       ['pop', 'uk pop']
4                           ['lgbtq+ hip hop', 'pop rap']
                              ...                        
1551                       ['dance pop', 'pop', 'uk pop']
1552             ['sertanejo', 'sertanejo universitario']
1553    ['dance pop', 'electropop', 'pop', 'post-teen ...
1554                       ['brega funk', 'funk carioca']
1555                             ['pop', 'post-teen pop']
Name: Genre, Length: 1556, dtype: object

It doesn't always work.  For example, in the following, we can't use attribute access because *Release Date* has a space in it.

In [41]:
df.Release Date

SyntaxError: invalid syntax (3791632194.py, line 1)

In [44]:
df["Release Date"]

0       2017-12-08
1       2021-07-09
2       2021-05-21
3       2021-06-25
4       2021-07-23
           ...    
1551    2017-06-02
1552    2019-10-11
1553    2018-01-12
1554    2019-09-25
1555    2019-11-13
Name: Release Date, Length: 1556, dtype: object

In [45]:
pd.to_datetime(df["Release Date"]).dt.day

0        8.0
1        9.0
2       21.0
3       25.0
4       23.0
        ... 
1551     2.0
1552    11.0
1553    12.0
1554    25.0
1555    13.0
Name: Release Date, Length: 1556, dtype: float64

In [46]:
year_series = pd.to_datetime(df["Release Date"]).dt.year

Sample exercise: How many `2019`s are there in `year_series`?

In [47]:
(year_series == 2019).sum()

181

Here we're trying a different method, but it doesn't work at first because of null values.  I realized after class that we could have used a keyword argument to `map` called `na_action`, but during class we removed the null values by hand.

In [48]:
df["Release Date"].map(lambda s: s[:4] == 2019)

TypeError: 'float' object is not subscriptable

In [None]:
np.nan[:4]

TypeError: 'float' object is not subscriptable

During class I made another mistake before getting to the next cell, but I deleted it from this notebook because it's more confusing than helpful.

Here is an alternate approach.

In [50]:
clean = df["Release Date"][~df["Release Date"].isna()]

In [51]:
type(clean)

pandas.core.series.Series

In [52]:
clean

0       2017-12-08
1       2021-07-09
2       2021-05-21
3       2021-06-25
4       2021-07-23
           ...    
1551    2017-06-02
1552    2019-10-11
1553    2018-01-12
1554    2019-09-25
1555    2019-11-13
Name: Release Date, Length: 1545, dtype: object

Two different ways to count `2019`s in this pandas Series of strings.  Notice that they give the same answer as the `.dt.year` method from above.

In [55]:
clean.map(lambda s: s[:4] == "2019").sum()

181

In [56]:
clean.map(lambda s: int(s[:4]) == 2019).sum()

181