Week 10, Monday¶

YuJa recording

import seaborn as sns
import numpy as np
import pandas as pd

Lecture¶

We will practice analyzing and cleaning a dataset. This dataset contains scaled versions of the Midterm 2 scores.

Can you identify which problem needed to be curved differently between the two versions of the midterm?

df = pd.read_csv("../data/Midterm_scaled.csv")

df.head()

	Student	ID	SIS User ID	SIS Login ID	Section	1a	1b	1c	1d	1e	1f	2	3	Column that is randomly blank.	Version
0	NaN	NaN	NaN	NaN	NaN	Out of 3	Out of 3	Out of 3	Out of 3	Out of 3	Out of 3	Out of 4	Out of 8	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Chris Davis	23232.0	23251.0	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	-0.717101719	0.842970667	0.887970408	0.972078231	0.654139006	0.264135272	1.072380529	0.887667048	NaN	a
3	Chris Davis	23232.0	23251.0	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	1.223291168	-1.351934088	0.887970408	0.972078231	0.654139006	0.264135272	0	0.887667048	NaN	NaN
4	Chris Davis	23232.0	23251.0	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	-0.717101719	-1.351934088	-1.206729017	-0.470360434	-2.081351382	0.264135272	0	-1.432371827	NaN	a

df.dtypes

Student                             object
ID                                 float64
SIS User ID                        float64
SIS Login ID                        object
Section                             object
1a                                  object
1b                                  object
1c                                  object
1d                                  object
1e                                  object
1f                                  object
2                                   object
3                                   object
Column that is randomly blank.     float64
Version                             object
dtype: object

We want to convert most of those columns to numeric values.

pd.to_numeric(df["1a"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/_libs/lib.pyx:2315, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "Out of 3"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Input In [5], in <module>
----> 1 pd.to_numeric(df["1a"])

File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/core/tools/numeric.py:184, in to_numeric(arg, errors, downcast)
    182 coerce_numeric = errors not in ("ignore", "raise")
    183 try:
--> 184     values, _ = lib.maybe_convert_numeric(
    185         values, set(), coerce_numeric=coerce_numeric
    186     )
    187 except (ValueError, TypeError):
    188     if errors == "raise":

File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/_libs/lib.pyx:2357, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "Out of 3" at position 0

We can fix that error by using the errors keyword argument.

pd.to_numeric(df["1a"], errors="coerce")

        NaN
        NaN
  -0.717102
   1.223291
  -0.717102
  -0.717102
  -0.717102
  -0.717102
   1.223291
  -0.717102
 -0.717102
 -0.717102
 -1.687298
 -0.717102
  1.223291
  1.223291
 -1.687298
  1.223291
 -0.717102
 -0.717102
 -0.717102
 -0.717102
 -0.717102
  1.223291
  1.223291
  1.223291
  1.223291
 -0.717102
  1.223291
  1.223291
 -0.717102
 -0.717102
  1.223291
 -0.717102
 -0.717102
  1.223291
 -0.717102
 -0.717102
  1.223291
 -0.717102
  1.223291
 -0.717102
 -0.717102
  1.223291
  1.223291
 -0.717102
  1.223291
 -0.717102
Name: 1a, dtype: float64

If we want to do the same thing to all the columns from “1a” to “3”, we can use apply and a lambda function.

df.loc[:,"1a":"3"].apply(lambda col: pd.to_numeric(col, errors="coerce"), axis=0)

	1a	1b	1c	1d	1e	1f	2	3
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	-0.717102	0.842971	0.887970	0.972078	0.654139	0.264135	1.072381	0.887667
3	1.223291	-1.351934	0.887970	0.972078	0.654139	0.264135	0.000000	0.887667
4	-0.717102	-1.351934	-1.206729	-0.470360	-2.081351	0.264135	0.000000	-1.432372
5	-0.717102	-1.351934	0.887970	-0.470360	0.654139	0.264135	0.000000	-0.968364
6	-0.717102	-1.351934	-1.206729	-0.470360	0.654139	0.264135	0.000000	-1.432372
7	-0.717102	-0.620299	-0.159379	-0.470360	0.654139	0.264135	-1.072381	-0.968364
8	1.223291	0.842971	0.887970	0.972078	0.654139	0.264135	1.072381	0.887667
9	-0.717102	-0.620299	-1.206729	0.972078	0.654139	0.264135	1.072381	0.423659
10	-0.717102	-1.351934	0.887970	-1.912799	0.654139	0.264135	0.000000	0.887667
11	-0.717102	0.842971	0.887970	0.972078	0.654139	0.264135	1.072381	0.887667
12	-1.687298	-1.351934	-0.159379	-0.470360	-2.081351	0.264135	-3.217142	-0.968364
13	-0.717102	0.842971	-0.159379	-0.470360	-1.169521	0.264135	-1.072381	-1.432372
14	1.223291	0.842971	0.887970	0.972078	0.654139	0.264135	0.000000	0.887667
15	1.223291	0.842971	-1.206729	-0.470360	0.654139	-3.785939	-1.072381	-0.504356
16	-1.687298	-1.351934	-1.206729	-0.470360	-2.081351	-3.785939	0.000000	-1.896380
17	1.223291	0.842971	0.887970	0.972078	0.654139	0.264135	1.072381	0.423659
18	-0.717102	-1.351934	-0.159379	-0.470360	-1.169521	0.264135	1.072381	0.887667
19	-0.717102	-1.351934	-0.159379	-0.470360	0.654139	0.264135	0.000000	-0.040349
20	-0.717102	0.842971	0.887970	0.972078	0.654139	0.264135	0.000000	0.887667
21	-0.717102	0.842971	0.887970	0.972078	0.654139	0.264135	1.072381	0.887667
22	-0.717102	0.842971	0.887970	-0.470360	0.654139	0.264135	1.072381	0.887667
23	1.223291	-1.351934	-0.159379	-0.470360	-0.257691	0.264135	0.000000	-0.504356
24	1.223291	0.842971	0.887970	0.972078	0.654139	0.264135	1.072381	0.887667
25	1.223291	0.842971	-1.206729	-0.470360	0.654139	0.264135	0.000000	-0.504356
26	1.223291	0.842971	0.887970	0.972078	0.654139	0.264135	0.000000	0.887667
27	-0.717102	0.842971	-1.206729	-1.912799	-1.169521	0.264135	-1.072381	-1.432372
28	1.223291	0.842971	0.887970	0.972078	0.654139	0.264135	0.000000	0.887667
29	1.223291	0.842971	-1.206729	-1.912799	0.654139	0.264135	1.072381	-0.040349
30	-0.717102	0.842971	-1.206729	0.972078	-1.169521	0.264135	0.000000	0.887667
31	-0.717102	0.842971	-0.159379	-0.470360	0.654139	0.264135	-1.072381	-0.968364
32	1.223291	0.842971	0.887970	0.972078	0.654139	0.264135	1.072381	0.887667
33	-0.717102	0.842971	0.887970	0.972078	0.654139	0.264135	1.072381	0.887667
34	-0.717102	0.842971	-0.159379	-0.470360	-0.257691	0.264135	-1.072381	-0.968364
35	1.223291	0.842971	-2.254079	-0.470360	0.654139	0.264135	0.000000	-0.040349
36	-0.717102	-1.351934	-1.206729	-0.470360	-1.169521	0.264135	-3.217142	-0.040349
37	-0.717102	-1.351934	-1.206729	-0.470360	0.654139	0.264135	0.000000	0.887667
38	1.223291	0.842971	-1.206729	-0.470360	-1.169521	0.264135	0.000000	-0.504356
39	-0.717102	-1.351934	-2.254079	-3.355238	-2.081351	0.264135	-1.072381	-1.896380
40	1.223291	-0.620299	0.887970	-0.470360	-1.169521	0.264135	-1.072381	0.887667
41	-0.717102	0.842971	0.887970	0.972078	0.654139	0.264135	1.072381	-0.040349
42	-0.717102	0.842971	0.887970	0.972078	0.654139	0.264135	0.000000	0.887667
43	1.223291	-0.620299	0.887970	0.972078	-1.169521	0.264135	0.000000	-0.040349
44	1.223291	0.842971	0.887970	0.972078	0.654139	0.264135	1.072381	0.887667
45	-0.717102	-0.620299	0.887970	0.972078	0.654139	-3.785939	0.000000	-0.040349
46	1.223291	-1.351934	-0.159379	-0.470360	-2.081351	0.264135	0.000000	-2.824395
47	-0.717102	0.111336	0.887970	-0.470360	0.654139	0.264135	0.000000	0.887667

For this particular dataset, a much easier strategy is just to specify during the import that we want to skip the top two rows (after the header row).

df = pd.read_csv("../data/Midterm_scaled.csv", skiprows=[1,2])

df.head()

	Student	ID	SIS User ID	SIS Login ID	Section	1a	1b	1c	1d	1e	1f	2	3	Column that is randomly blank.	Version
0	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	-0.717102	0.842971	0.887970	0.972078	0.654139	0.264135	1.072381	0.887667	NaN	a
1	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	1.223291	-1.351934	0.887970	0.972078	0.654139	0.264135	0.000000	0.887667	NaN	NaN
2	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	-0.717102	-1.351934	-1.206729	-0.470360	-2.081351	0.264135	0.000000	-1.432372	NaN	a
3	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	-0.717102	-1.351934	0.887970	-0.470360	0.654139	0.264135	0.000000	-0.968364	NaN	NaN
4	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	-0.717102	-1.351934	-1.206729	-0.470360	0.654139	0.264135	0.000000	-1.432372	NaN	NaN

The original dataset only specifies who had version “a”, not version “b”. Let’s fill in version “b” in place of the NaN values.

df["Version"] = df["Version"].fillna("b")

Here are some examples using df.groupby. There are more examples below in the Worksheet portion.

This shows all possible combinations of (scaled) scores on problems “1a” and “1b”. For example, this shows that 3 students scored -0.717101719 on 1a and -0.62029917 on 1b.

for x,y in df.groupby(["1a","1b"]):
    print(f"The value of x is {x}")
    print(y.shape)
    print("")

The value of x is (-1.687298162, -1.351934088)
(2, 15)

The value of x is (-0.717101719, -1.351934088)
(9, 15)

The value of x is (-0.717101719, -0.62029917)
(3, 15)

The value of x is (-0.717101719, 0.111335748)
(1, 15)

The value of x is (-0.717101719, 0.842970667)
(13, 15)

The value of x is (1.223291168, -1.351934088)
(3, 15)

The value of x is (1.223291168, -0.62029917)
(2, 15)

The value of x is (1.223291168, 0.842970667)
(13, 15)

For our question of whether one version was easier than the other version, we are interested in grouping by “Version”.

for x,y in df.groupby("Version"):
    print(f"The value of x is {x}")
    print(y.shape)

The value of x is a
(21, 15)
The value of x is b
(25, 15)

In this code, y is a DataFrame. Notice for example how in the first displayed DataFrame, the exams are all Version a.

for x,y in df.groupby("Version"):
    print(f"The value of x is {x}")
    display(y.head())

The value of x is a

	Student	ID	SIS User ID	SIS Login ID	Section	1a	1b	1c	1d	1e	1f	2	3	Column that is randomly blank.	Version
0	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	-0.717102	0.842971	0.887970	0.972078	0.654139	0.264135	1.072381	0.887667	NaN	a
2	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	-0.717102	-1.351934	-1.206729	-0.470360	-2.081351	0.264135	0.000000	-1.432372	NaN	a
5	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	-0.717102	-0.620299	-0.159379	-0.470360	0.654139	0.264135	-1.072381	-0.968364	NaN	a
11	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	-0.717102	0.842971	-0.159379	-0.470360	-1.169521	0.264135	-1.072381	-1.432372	NaN	a
14	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	-1.687298	-1.351934	-1.206729	-0.470360	-2.081351	-3.785939	0.000000	-1.896380	NaN	a

The value of x is b

	Student	ID	SIS User ID	SIS Login ID	Section	1a	1b	1c	1d	1e	1f	2	3	Column that is randomly blank.	Version
1	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	1.223291	-1.351934	0.887970	0.972078	0.654139	0.264135	0.000000	0.887667	NaN	b
3	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	-0.717102	-1.351934	0.887970	-0.470360	0.654139	0.264135	0.000000	-0.968364	NaN	b
4	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	-0.717102	-1.351934	-1.206729	-0.470360	0.654139	0.264135	0.000000	-1.432372	NaN	b
6	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	1.223291	0.842971	0.887970	0.972078	0.654139	0.264135	1.072381	0.887667	NaN	b
7	Chris Davis	23232	23251	daviscj	MATH 10 LEC A: INTRO PROG DATA SCI (44870)	-0.717102	-0.620299	-1.206729	0.972078	0.654139	0.264135	1.072381	0.423659	NaN	b

Instead of iterating over the different possibilities, we can also perform what is called an aggregation operation, such as taking the mean.

df.groupby("Version").mean()

	ID	SIS User ID	1a	1b	1c	1d	1e	1f	2	3	Column that is randomly blank.
Version
a	23232.0	23251.0	-0.024104	0.355214	-0.109506	-0.126923	-0.084009	0.071275	0.153197	-0.018253	NaN
b	23232.0	23251.0	0.020248	-0.298380	0.091985	0.106615	0.070568	-0.059871	-0.128686	0.015332	NaN

It’s a little easier to read if we take the transpose.

df.groupby("Version").mean().T

Version	a	b
ID	23232.000000	23232.000000
SIS User ID	23251.000000	23251.000000
1a	-0.024104	0.020248
1b	0.355214	-0.298380
1c	-0.109506	0.091985
1d	-0.126923	0.106615
1e	-0.084009	0.070568
1f	0.071275	-0.059871
2	0.153197	-0.128686
3	-0.018253	0.015332
Column that is randomly blank.	NaN	NaN

We can also apply formatting to these strings, by saying we only want three decimal places.

df.groupby("Version").mean().T.applymap(lambda x: f"{x:.3f}")

Version	a	b
ID	23232.000	23232.000
SIS User ID	23251.000	23251.000
1a	-0.024	0.020
1b	0.355	-0.298
1c	-0.110	0.092
1d	-0.127	0.107
1e	-0.084	0.071
1f	0.071	-0.060
2	0.153	-0.129
3	-0.018	0.015
Column that is randomly blank.	nan	nan

Notice how the 1b value is significantly higher in the “a” column than in the “b” column. This is the reason that the “b” version of the exam was curved one point more than the “a” version.

Worksheet¶

(This worksheet contains some repetition from the portion above.)

df = sns.load_dataset("taxis")
df.dropna(inplace=True)

Practice with pandas groupby¶

We haven’t covered pandas groupby in Math 10 before today. This is a possible “extra topic” for the course project.

Here is an example using groupby. We also use f-strings.

for a,b in df.groupby("pickup_zone"):
    print(f"a is {a}")
    print(f"The type of b is {type(b)}")
    print(f"The shape of b is {b.shape}")
    break

a is Allerton/Pelham Gardens
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (2, 14)

If we instead wanted to get the first 5 values, we could do something like the following. For example, this indicates that 65 rides began in the pickup zone “Astoria”.

counter = 0

for a,b in df.groupby("pickup_zone"):
    print(f"a is {a}")
    print(f"The type of b is {type(b)}")
    print(f"The shape of b is {b.shape}")
    print("")
    counter += 1

    if counter >= 5:
        break

a is Allerton/Pelham Gardens
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (2, 14)

a is Alphabet City
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (9, 14)

a is Astoria
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (65, 14)

a is Auburndale
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (1, 14)

a is Battery Park
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (1, 14)

You can also group by multiple categories. For example, the following indicates that only 4 rides in the dataset began in Bronx and finished in Brooklyn.

counter = 0

for a,b in df.groupby(["pickup_borough","dropoff_borough"]):
    print(f"a is {a}")
    print(f"The type of b is {type(b)}")
    print(f"The shape of b is {b.shape}")
    print("")
    counter += 1

    if counter >= 5:
        break

a is ('Bronx', 'Bronx')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (66, 14)

a is ('Bronx', 'Brooklyn')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (4, 14)

a is ('Bronx', 'Manhattan')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (25, 14)

a is ('Bronx', 'Queens')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (4, 14)

a is ('Brooklyn', 'Bronx')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (5, 14)

Sample exercises:

For each pickup borough, using f-strings, print the average tip for rides that begin in that borough.
Try producing a sub-DataFrame, df_sub, which contains only the “distance”, “fare”, “tip”, and “pickup_zone” columns, and which contains only rows where the “tip” amount is greater than zero. Then execute df_sub.groupby("pickup_zone").mean(). What information is this providing?
Do the same thing as in the previous exercise, but instead find what the maximum was in each category, instead of the average.

Practice with pandas styler¶

We haven’t covered pandas styler in Math 10. This is a possible “extra topic” for the course project.

Based on the pandas documentation.

As an example, we will color the cells blue for which the “pickup_zone” or “dropoff_zone” contains the word “Midtown”.

def make_blue(x):
    if "Midtown" in x:
        return 'color:white;background-color:darkblue'
    else:
        return None

You will have to scroll right to see the blue cells. We only display the first 20 rows.

df[:20].style.applymap(make_blue,subset=["pickup_zone","dropoff_zone"])

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.600000	7.000000	2.150000	12.950000	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.790000	5.000000	0.000000	9.300000	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.370000	7.500000	2.360000	14.160000	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.700000	27.000000	6.150000	36.950000	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.160000	9.000000	1.100000	13.400000	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan
5	2019-03-11 10:37:23	2019-03-11 10:47:31	1	0.490000	7.500000	2.160000	12.960000	yellow	credit card	Times Sq/Theatre District	Midtown East	Manhattan	Manhattan
6	2019-03-26 21:07:31	2019-03-26 21:17:29	1	3.650000	13.000000	2.000000	18.800000	yellow	credit card	Battery Park City	Two Bridges/Seward Park	Manhattan	Manhattan
8	2019-03-23 11:48:50	2019-03-23 12:06:14	1	3.630000	15.000000	1.000000	19.300000	yellow	credit card	East Harlem South	Midtown Center	Manhattan	Manhattan
9	2019-03-08 16:18:37	2019-03-08 16:26:57	1	1.520000	8.000000	1.000000	13.300000	yellow	credit card	Lincoln Square East	Central Park	Manhattan	Manhattan
10	2019-03-16 10:02:25	2019-03-16 10:22:29	1	3.900000	17.000000	0.000000	17.800000	yellow	cash	LaGuardia Airport	Astoria	Queens	Queens
11	2019-03-20 19:39:42	2019-03-20 19:45:36	1	1.530000	6.500000	2.160000	12.960000	yellow	credit card	Upper West Side South	Manhattan Valley	Manhattan	Manhattan
12	2019-03-18 21:27:14	2019-03-18 21:34:16	1	1.050000	6.500000	1.000000	11.300000	yellow	credit card	Murray Hill	Midtown Center	Manhattan	Manhattan
13	2019-03-19 07:55:25	2019-03-19 08:09:17	1	1.750000	10.500000	0.000000	13.800000	yellow	cash	Lincoln Square West	Times Sq/Theatre District	Manhattan	Manhattan
14	2019-03-27 12:13:34	2019-03-27 12:25:48	0	2.900000	11.500000	0.000000	14.800000	yellow	cash	Financial District North	Two Bridges/Seward Park	Manhattan	Manhattan
15	2019-03-16 17:53:34	2019-03-16 18:13:57	3	2.090000	13.500000	0.000000	16.800000	yellow	cash	Upper West Side North	Clinton East	Manhattan	Manhattan
16	2019-03-15 12:36:05	2019-03-15 12:54:28	1	2.120000	13.000000	0.000000	16.300000	yellow	cash	East Chelsea	Meatpacking/West Village West	Manhattan	Manhattan
17	2019-03-23 20:50:49	2019-03-23 21:02:07	1	2.600000	10.500000	2.000000	16.300000	yellow	credit card	Midtown Center	East Harlem South	Manhattan	Manhattan
18	2019-03-27 06:28:36	2019-03-27 06:38:10	1	2.180000	9.500000	1.920000	14.720000	yellow	credit card	Gramercy	Midtown Center	Manhattan	Manhattan
19	2019-03-25 22:04:25	2019-03-25 22:11:30	6	1.080000	6.500000	1.080000	11.380000	yellow	credit card	East Chelsea	East Chelsea	Manhattan	Manhattan
20	2019-03-21 03:37:34	2019-03-21 03:44:13	1	1.070000	6.500000	1.540000	11.840000	yellow	credit card	Penn Station/Madison Sq West	Kips Bay	Manhattan	Manhattan

Here is a similar example, but where we color every cell in the row a random color. Notice the use of f-strings.

rng = np.random.default_rng()
color_list = ["red","purple","orange","wheat","black","blue"]
prop_list = [f'color:white;background-color:{c}' for c in color_list]

def find_midtown(row):
    if ("Midtown" in row["dropoff_zone"]) or ("Midtown" in row["pickup_zone"]):
        return rng.choice(prop_list, size=len(row))
    else:
        return [None]*len(row)

df[:20].style.apply(find_midtown,axis=1)

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.600000	7.000000	2.150000	12.950000	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.790000	5.000000	0.000000	9.300000	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.370000	7.500000	2.360000	14.160000	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.700000	27.000000	6.150000	36.950000	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.160000	9.000000	1.100000	13.400000	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan
5	2019-03-11 10:37:23	2019-03-11 10:47:31	1	0.490000	7.500000	2.160000	12.960000	yellow	credit card	Times Sq/Theatre District	Midtown East	Manhattan	Manhattan
6	2019-03-26 21:07:31	2019-03-26 21:17:29	1	3.650000	13.000000	2.000000	18.800000	yellow	credit card	Battery Park City	Two Bridges/Seward Park	Manhattan	Manhattan
8	2019-03-23 11:48:50	2019-03-23 12:06:14	1	3.630000	15.000000	1.000000	19.300000	yellow	credit card	East Harlem South	Midtown Center	Manhattan	Manhattan
9	2019-03-08 16:18:37	2019-03-08 16:26:57	1	1.520000	8.000000	1.000000	13.300000	yellow	credit card	Lincoln Square East	Central Park	Manhattan	Manhattan
10	2019-03-16 10:02:25	2019-03-16 10:22:29	1	3.900000	17.000000	0.000000	17.800000	yellow	cash	LaGuardia Airport	Astoria	Queens	Queens
11	2019-03-20 19:39:42	2019-03-20 19:45:36	1	1.530000	6.500000	2.160000	12.960000	yellow	credit card	Upper West Side South	Manhattan Valley	Manhattan	Manhattan
12	2019-03-18 21:27:14	2019-03-18 21:34:16	1	1.050000	6.500000	1.000000	11.300000	yellow	credit card	Murray Hill	Midtown Center	Manhattan	Manhattan
13	2019-03-19 07:55:25	2019-03-19 08:09:17	1	1.750000	10.500000	0.000000	13.800000	yellow	cash	Lincoln Square West	Times Sq/Theatre District	Manhattan	Manhattan
14	2019-03-27 12:13:34	2019-03-27 12:25:48	0	2.900000	11.500000	0.000000	14.800000	yellow	cash	Financial District North	Two Bridges/Seward Park	Manhattan	Manhattan
15	2019-03-16 17:53:34	2019-03-16 18:13:57	3	2.090000	13.500000	0.000000	16.800000	yellow	cash	Upper West Side North	Clinton East	Manhattan	Manhattan
16	2019-03-15 12:36:05	2019-03-15 12:54:28	1	2.120000	13.000000	0.000000	16.300000	yellow	cash	East Chelsea	Meatpacking/West Village West	Manhattan	Manhattan
17	2019-03-23 20:50:49	2019-03-23 21:02:07	1	2.600000	10.500000	2.000000	16.300000	yellow	credit card	Midtown Center	East Harlem South	Manhattan	Manhattan
18	2019-03-27 06:28:36	2019-03-27 06:38:10	1	2.180000	9.500000	1.920000	14.720000	yellow	credit card	Gramercy	Midtown Center	Manhattan	Manhattan
19	2019-03-25 22:04:25	2019-03-25 22:11:30	6	1.080000	6.500000	1.080000	11.380000	yellow	credit card	East Chelsea	East Chelsea	Manhattan	Manhattan
20	2019-03-21 03:37:34	2019-03-21 03:44:13	1	1.070000	6.500000	1.540000	11.840000	yellow	credit card	Penn Station/Madison Sq West	Kips Bay	Manhattan	Manhattan

pandas styler sample exercises:

Try changing the text color to red on all rides where the fare was at least 10 dollars.
For all cells where the pickup time is between 11pm and midnight, try giving those cells a black background with white text.
For how many rides was the tip amount greater than 40% of the fare? Try coloring the entire row for those rides in red.

UC Irvine Math 10 W22

Week 10, Monday

Contents

Week 10, Monday¶

Lecture¶

Worksheet¶

Practice with pandas groupby¶

Practice with pandas styler¶