Week 10, Monday

YuJa recording

import seaborn as sns
import numpy as np
import pandas as pd

Lecture

We will practice analyzing and cleaning a dataset. This dataset contains scaled versions of the Midterm 2 scores.

  • Can you identify which problem needed to be curved differently between the two versions of the midterm?

df = pd.read_csv("../data/Midterm_scaled.csv")
df.head()
Student ID SIS User ID SIS Login ID Section 1a 1b 1c 1d 1e 1f 2 3 Column that is randomly blank. Version
0 NaN NaN NaN NaN NaN Out of 3 Out of 3 Out of 3 Out of 3 Out of 3 Out of 3 Out of 4 Out of 8 NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Chris Davis 23232.0 23251.0 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) -0.717101719 0.842970667 0.887970408 0.972078231 0.654139006 0.264135272 1.072380529 0.887667048 NaN a
3 Chris Davis 23232.0 23251.0 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) 1.223291168 -1.351934088 0.887970408 0.972078231 0.654139006 0.264135272 0 0.887667048 NaN NaN
4 Chris Davis 23232.0 23251.0 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) -0.717101719 -1.351934088 -1.206729017 -0.470360434 -2.081351382 0.264135272 0 -1.432371827 NaN a
df.dtypes
Student                             object
ID                                 float64
SIS User ID                        float64
SIS Login ID                        object
Section                             object
1a                                  object
1b                                  object
1c                                  object
1d                                  object
1e                                  object
1f                                  object
2                                   object
3                                   object
Column that is randomly blank.     float64
Version                             object
dtype: object

We want to convert most of those columns to numeric values.

pd.to_numeric(df["1a"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/_libs/lib.pyx:2315, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "Out of 3"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Input In [5], in <module>
----> 1 pd.to_numeric(df["1a"])

File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/core/tools/numeric.py:184, in to_numeric(arg, errors, downcast)
    182 coerce_numeric = errors not in ("ignore", "raise")
    183 try:
--> 184     values, _ = lib.maybe_convert_numeric(
    185         values, set(), coerce_numeric=coerce_numeric
    186     )
    187 except (ValueError, TypeError):
    188     if errors == "raise":

File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/_libs/lib.pyx:2357, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "Out of 3" at position 0

We can fix that error by using the errors keyword argument.

pd.to_numeric(df["1a"], errors="coerce")
0          NaN
1          NaN
2    -0.717102
3     1.223291
4    -0.717102
5    -0.717102
6    -0.717102
7    -0.717102
8     1.223291
9    -0.717102
10   -0.717102
11   -0.717102
12   -1.687298
13   -0.717102
14    1.223291
15    1.223291
16   -1.687298
17    1.223291
18   -0.717102
19   -0.717102
20   -0.717102
21   -0.717102
22   -0.717102
23    1.223291
24    1.223291
25    1.223291
26    1.223291
27   -0.717102
28    1.223291
29    1.223291
30   -0.717102
31   -0.717102
32    1.223291
33   -0.717102
34   -0.717102
35    1.223291
36   -0.717102
37   -0.717102
38    1.223291
39   -0.717102
40    1.223291
41   -0.717102
42   -0.717102
43    1.223291
44    1.223291
45   -0.717102
46    1.223291
47   -0.717102
Name: 1a, dtype: float64

If we want to do the same thing to all the columns from “1a” to “3”, we can use apply and a lambda function.

df.loc[:,"1a":"3"].apply(lambda col: pd.to_numeric(col, errors="coerce"), axis=0)
1a 1b 1c 1d 1e 1f 2 3
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 -0.717102 0.842971 0.887970 0.972078 0.654139 0.264135 1.072381 0.887667
3 1.223291 -1.351934 0.887970 0.972078 0.654139 0.264135 0.000000 0.887667
4 -0.717102 -1.351934 -1.206729 -0.470360 -2.081351 0.264135 0.000000 -1.432372
5 -0.717102 -1.351934 0.887970 -0.470360 0.654139 0.264135 0.000000 -0.968364
6 -0.717102 -1.351934 -1.206729 -0.470360 0.654139 0.264135 0.000000 -1.432372
7 -0.717102 -0.620299 -0.159379 -0.470360 0.654139 0.264135 -1.072381 -0.968364
8 1.223291 0.842971 0.887970 0.972078 0.654139 0.264135 1.072381 0.887667
9 -0.717102 -0.620299 -1.206729 0.972078 0.654139 0.264135 1.072381 0.423659
10 -0.717102 -1.351934 0.887970 -1.912799 0.654139 0.264135 0.000000 0.887667
11 -0.717102 0.842971 0.887970 0.972078 0.654139 0.264135 1.072381 0.887667
12 -1.687298 -1.351934 -0.159379 -0.470360 -2.081351 0.264135 -3.217142 -0.968364
13 -0.717102 0.842971 -0.159379 -0.470360 -1.169521 0.264135 -1.072381 -1.432372
14 1.223291 0.842971 0.887970 0.972078 0.654139 0.264135 0.000000 0.887667
15 1.223291 0.842971 -1.206729 -0.470360 0.654139 -3.785939 -1.072381 -0.504356
16 -1.687298 -1.351934 -1.206729 -0.470360 -2.081351 -3.785939 0.000000 -1.896380
17 1.223291 0.842971 0.887970 0.972078 0.654139 0.264135 1.072381 0.423659
18 -0.717102 -1.351934 -0.159379 -0.470360 -1.169521 0.264135 1.072381 0.887667
19 -0.717102 -1.351934 -0.159379 -0.470360 0.654139 0.264135 0.000000 -0.040349
20 -0.717102 0.842971 0.887970 0.972078 0.654139 0.264135 0.000000 0.887667
21 -0.717102 0.842971 0.887970 0.972078 0.654139 0.264135 1.072381 0.887667
22 -0.717102 0.842971 0.887970 -0.470360 0.654139 0.264135 1.072381 0.887667
23 1.223291 -1.351934 -0.159379 -0.470360 -0.257691 0.264135 0.000000 -0.504356
24 1.223291 0.842971 0.887970 0.972078 0.654139 0.264135 1.072381 0.887667
25 1.223291 0.842971 -1.206729 -0.470360 0.654139 0.264135 0.000000 -0.504356
26 1.223291 0.842971 0.887970 0.972078 0.654139 0.264135 0.000000 0.887667
27 -0.717102 0.842971 -1.206729 -1.912799 -1.169521 0.264135 -1.072381 -1.432372
28 1.223291 0.842971 0.887970 0.972078 0.654139 0.264135 0.000000 0.887667
29 1.223291 0.842971 -1.206729 -1.912799 0.654139 0.264135 1.072381 -0.040349
30 -0.717102 0.842971 -1.206729 0.972078 -1.169521 0.264135 0.000000 0.887667
31 -0.717102 0.842971 -0.159379 -0.470360 0.654139 0.264135 -1.072381 -0.968364
32 1.223291 0.842971 0.887970 0.972078 0.654139 0.264135 1.072381 0.887667
33 -0.717102 0.842971 0.887970 0.972078 0.654139 0.264135 1.072381 0.887667
34 -0.717102 0.842971 -0.159379 -0.470360 -0.257691 0.264135 -1.072381 -0.968364
35 1.223291 0.842971 -2.254079 -0.470360 0.654139 0.264135 0.000000 -0.040349
36 -0.717102 -1.351934 -1.206729 -0.470360 -1.169521 0.264135 -3.217142 -0.040349
37 -0.717102 -1.351934 -1.206729 -0.470360 0.654139 0.264135 0.000000 0.887667
38 1.223291 0.842971 -1.206729 -0.470360 -1.169521 0.264135 0.000000 -0.504356
39 -0.717102 -1.351934 -2.254079 -3.355238 -2.081351 0.264135 -1.072381 -1.896380
40 1.223291 -0.620299 0.887970 -0.470360 -1.169521 0.264135 -1.072381 0.887667
41 -0.717102 0.842971 0.887970 0.972078 0.654139 0.264135 1.072381 -0.040349
42 -0.717102 0.842971 0.887970 0.972078 0.654139 0.264135 0.000000 0.887667
43 1.223291 -0.620299 0.887970 0.972078 -1.169521 0.264135 0.000000 -0.040349
44 1.223291 0.842971 0.887970 0.972078 0.654139 0.264135 1.072381 0.887667
45 -0.717102 -0.620299 0.887970 0.972078 0.654139 -3.785939 0.000000 -0.040349
46 1.223291 -1.351934 -0.159379 -0.470360 -2.081351 0.264135 0.000000 -2.824395
47 -0.717102 0.111336 0.887970 -0.470360 0.654139 0.264135 0.000000 0.887667

For this particular dataset, a much easier strategy is just to specify during the import that we want to skip the top two rows (after the header row).

df = pd.read_csv("../data/Midterm_scaled.csv", skiprows=[1,2])
df.head()
Student ID SIS User ID SIS Login ID Section 1a 1b 1c 1d 1e 1f 2 3 Column that is randomly blank. Version
0 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) -0.717102 0.842971 0.887970 0.972078 0.654139 0.264135 1.072381 0.887667 NaN a
1 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) 1.223291 -1.351934 0.887970 0.972078 0.654139 0.264135 0.000000 0.887667 NaN NaN
2 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) -0.717102 -1.351934 -1.206729 -0.470360 -2.081351 0.264135 0.000000 -1.432372 NaN a
3 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) -0.717102 -1.351934 0.887970 -0.470360 0.654139 0.264135 0.000000 -0.968364 NaN NaN
4 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) -0.717102 -1.351934 -1.206729 -0.470360 0.654139 0.264135 0.000000 -1.432372 NaN NaN

The original dataset only specifies who had version “a”, not version “b”. Let’s fill in version “b” in place of the NaN values.

df["Version"] = df["Version"].fillna("b")

Here are some examples using df.groupby. There are more examples below in the Worksheet portion.

This shows all possible combinations of (scaled) scores on problems “1a” and “1b”. For example, this shows that 3 students scored -0.717101719 on 1a and -0.62029917 on 1b.

for x,y in df.groupby(["1a","1b"]):
    print(f"The value of x is {x}")
    print(y.shape)
    print("")
The value of x is (-1.687298162, -1.351934088)
(2, 15)

The value of x is (-0.717101719, -1.351934088)
(9, 15)

The value of x is (-0.717101719, -0.62029917)
(3, 15)

The value of x is (-0.717101719, 0.111335748)
(1, 15)

The value of x is (-0.717101719, 0.842970667)
(13, 15)

The value of x is (1.223291168, -1.351934088)
(3, 15)

The value of x is (1.223291168, -0.62029917)
(2, 15)

The value of x is (1.223291168, 0.842970667)
(13, 15)

For our question of whether one version was easier than the other version, we are interested in grouping by “Version”.

for x,y in df.groupby("Version"):
    print(f"The value of x is {x}")
    print(y.shape)
The value of x is a
(21, 15)
The value of x is b
(25, 15)

In this code, y is a DataFrame. Notice for example how in the first displayed DataFrame, the exams are all Version a.

for x,y in df.groupby("Version"):
    print(f"The value of x is {x}")
    display(y.head())
The value of x is a
Student ID SIS User ID SIS Login ID Section 1a 1b 1c 1d 1e 1f 2 3 Column that is randomly blank. Version
0 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) -0.717102 0.842971 0.887970 0.972078 0.654139 0.264135 1.072381 0.887667 NaN a
2 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) -0.717102 -1.351934 -1.206729 -0.470360 -2.081351 0.264135 0.000000 -1.432372 NaN a
5 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) -0.717102 -0.620299 -0.159379 -0.470360 0.654139 0.264135 -1.072381 -0.968364 NaN a
11 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) -0.717102 0.842971 -0.159379 -0.470360 -1.169521 0.264135 -1.072381 -1.432372 NaN a
14 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) -1.687298 -1.351934 -1.206729 -0.470360 -2.081351 -3.785939 0.000000 -1.896380 NaN a
The value of x is b
Student ID SIS User ID SIS Login ID Section 1a 1b 1c 1d 1e 1f 2 3 Column that is randomly blank. Version
1 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) 1.223291 -1.351934 0.887970 0.972078 0.654139 0.264135 0.000000 0.887667 NaN b
3 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) -0.717102 -1.351934 0.887970 -0.470360 0.654139 0.264135 0.000000 -0.968364 NaN b
4 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) -0.717102 -1.351934 -1.206729 -0.470360 0.654139 0.264135 0.000000 -1.432372 NaN b
6 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) 1.223291 0.842971 0.887970 0.972078 0.654139 0.264135 1.072381 0.887667 NaN b
7 Chris Davis 23232 23251 daviscj MATH 10 LEC A: INTRO PROG DATA SCI (44870) -0.717102 -0.620299 -1.206729 0.972078 0.654139 0.264135 1.072381 0.423659 NaN b

Instead of iterating over the different possibilities, we can also perform what is called an aggregation operation, such as taking the mean.

df.groupby("Version").mean()
ID SIS User ID 1a 1b 1c 1d 1e 1f 2 3 Column that is randomly blank.
Version
a 23232.0 23251.0 -0.024104 0.355214 -0.109506 -0.126923 -0.084009 0.071275 0.153197 -0.018253 NaN
b 23232.0 23251.0 0.020248 -0.298380 0.091985 0.106615 0.070568 -0.059871 -0.128686 0.015332 NaN

It’s a little easier to read if we take the transpose.

df.groupby("Version").mean().T
Version a b
ID 23232.000000 23232.000000
SIS User ID 23251.000000 23251.000000
1a -0.024104 0.020248
1b 0.355214 -0.298380
1c -0.109506 0.091985
1d -0.126923 0.106615
1e -0.084009 0.070568
1f 0.071275 -0.059871
2 0.153197 -0.128686
3 -0.018253 0.015332
Column that is randomly blank. NaN NaN

We can also apply formatting to these strings, by saying we only want three decimal places.

df.groupby("Version").mean().T.applymap(lambda x: f"{x:.3f}")
Version a b
ID 23232.000 23232.000
SIS User ID 23251.000 23251.000
1a -0.024 0.020
1b 0.355 -0.298
1c -0.110 0.092
1d -0.127 0.107
1e -0.084 0.071
1f 0.071 -0.060
2 0.153 -0.129
3 -0.018 0.015
Column that is randomly blank. nan nan

Notice how the 1b value is significantly higher in the “a” column than in the “b” column. This is the reason that the “b” version of the exam was curved one point more than the “a” version.

Worksheet

(This worksheet contains some repetition from the portion above.)

df = sns.load_dataset("taxis")
df.dropna(inplace=True)

Practice with pandas groupby

We haven’t covered pandas groupby in Math 10 before today. This is a possible “extra topic” for the course project.

Here is an example using groupby. We also use f-strings.

for a,b in df.groupby("pickup_zone"):
    print(f"a is {a}")
    print(f"The type of b is {type(b)}")
    print(f"The shape of b is {b.shape}")
    break
a is Allerton/Pelham Gardens
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (2, 14)

If we instead wanted to get the first 5 values, we could do something like the following. For example, this indicates that 65 rides began in the pickup zone “Astoria”.

counter = 0

for a,b in df.groupby("pickup_zone"):
    print(f"a is {a}")
    print(f"The type of b is {type(b)}")
    print(f"The shape of b is {b.shape}")
    print("")
    counter += 1

    if counter >= 5:
        break
a is Allerton/Pelham Gardens
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (2, 14)

a is Alphabet City
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (9, 14)

a is Astoria
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (65, 14)

a is Auburndale
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (1, 14)

a is Battery Park
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (1, 14)

You can also group by multiple categories. For example, the following indicates that only 4 rides in the dataset began in Bronx and finished in Brooklyn.

counter = 0

for a,b in df.groupby(["pickup_borough","dropoff_borough"]):
    print(f"a is {a}")
    print(f"The type of b is {type(b)}")
    print(f"The shape of b is {b.shape}")
    print("")
    counter += 1

    if counter >= 5:
        break
a is ('Bronx', 'Bronx')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (66, 14)

a is ('Bronx', 'Brooklyn')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (4, 14)

a is ('Bronx', 'Manhattan')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (25, 14)

a is ('Bronx', 'Queens')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (4, 14)

a is ('Brooklyn', 'Bronx')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (5, 14)

Sample exercises:

  1. For each pickup borough, using f-strings, print the average tip for rides that begin in that borough.

  2. Try producing a sub-DataFrame, df_sub, which contains only the “distance”, “fare”, “tip”, and “pickup_zone” columns, and which contains only rows where the “tip” amount is greater than zero. Then execute df_sub.groupby("pickup_zone").mean(). What information is this providing?

  3. Do the same thing as in the previous exercise, but instead find what the maximum was in each category, instead of the average.

Practice with pandas styler

We haven’t covered pandas styler in Math 10. This is a possible “extra topic” for the course project.

Based on the pandas documentation.

As an example, we will color the cells blue for which the “pickup_zone” or “dropoff_zone” contains the word “Midtown”.

def make_blue(x):
    if "Midtown" in x:
        return 'color:white;background-color:darkblue'
    else:
        return None

You will have to scroll right to see the blue cells. We only display the first 20 rows.

df[:20].style.applymap(make_blue,subset=["pickup_zone","dropoff_zone"])
  pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.600000 7.000000 2.150000 0.000000 12.950000 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.790000 5.000000 0.000000 0.000000 9.300000 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.370000 7.500000 2.360000 0.000000 14.160000 yellow credit card Alphabet City West Village Manhattan Manhattan
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.700000 27.000000 6.150000 0.000000 36.950000 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.160000 9.000000 1.100000 0.000000 13.400000 yellow credit card Midtown East Yorkville West Manhattan Manhattan
5 2019-03-11 10:37:23 2019-03-11 10:47:31 1 0.490000 7.500000 2.160000 0.000000 12.960000 yellow credit card Times Sq/Theatre District Midtown East Manhattan Manhattan
6 2019-03-26 21:07:31 2019-03-26 21:17:29 1 3.650000 13.000000 2.000000 0.000000 18.800000 yellow credit card Battery Park City Two Bridges/Seward Park Manhattan Manhattan
8 2019-03-23 11:48:50 2019-03-23 12:06:14 1 3.630000 15.000000 1.000000 0.000000 19.300000 yellow credit card East Harlem South Midtown Center Manhattan Manhattan
9 2019-03-08 16:18:37 2019-03-08 16:26:57 1 1.520000 8.000000 1.000000 0.000000 13.300000 yellow credit card Lincoln Square East Central Park Manhattan Manhattan
10 2019-03-16 10:02:25 2019-03-16 10:22:29 1 3.900000 17.000000 0.000000 0.000000 17.800000 yellow cash LaGuardia Airport Astoria Queens Queens
11 2019-03-20 19:39:42 2019-03-20 19:45:36 1 1.530000 6.500000 2.160000 0.000000 12.960000 yellow credit card Upper West Side South Manhattan Valley Manhattan Manhattan
12 2019-03-18 21:27:14 2019-03-18 21:34:16 1 1.050000 6.500000 1.000000 0.000000 11.300000 yellow credit card Murray Hill Midtown Center Manhattan Manhattan
13 2019-03-19 07:55:25 2019-03-19 08:09:17 1 1.750000 10.500000 0.000000 0.000000 13.800000 yellow cash Lincoln Square West Times Sq/Theatre District Manhattan Manhattan
14 2019-03-27 12:13:34 2019-03-27 12:25:48 0 2.900000 11.500000 0.000000 0.000000 14.800000 yellow cash Financial District North Two Bridges/Seward Park Manhattan Manhattan
15 2019-03-16 17:53:34 2019-03-16 18:13:57 3 2.090000 13.500000 0.000000 0.000000 16.800000 yellow cash Upper West Side North Clinton East Manhattan Manhattan
16 2019-03-15 12:36:05 2019-03-15 12:54:28 1 2.120000 13.000000 0.000000 0.000000 16.300000 yellow cash East Chelsea Meatpacking/West Village West Manhattan Manhattan
17 2019-03-23 20:50:49 2019-03-23 21:02:07 1 2.600000 10.500000 2.000000 0.000000 16.300000 yellow credit card Midtown Center East Harlem South Manhattan Manhattan
18 2019-03-27 06:28:36 2019-03-27 06:38:10 1 2.180000 9.500000 1.920000 0.000000 14.720000 yellow credit card Gramercy Midtown Center Manhattan Manhattan
19 2019-03-25 22:04:25 2019-03-25 22:11:30 6 1.080000 6.500000 1.080000 0.000000 11.380000 yellow credit card East Chelsea East Chelsea Manhattan Manhattan
20 2019-03-21 03:37:34 2019-03-21 03:44:13 1 1.070000 6.500000 1.540000 0.000000 11.840000 yellow credit card Penn Station/Madison Sq West Kips Bay Manhattan Manhattan

Here is a similar example, but where we color every cell in the row a random color. Notice the use of f-strings.

rng = np.random.default_rng()
color_list = ["red","purple","orange","wheat","black","blue"]
prop_list = [f'color:white;background-color:{c}' for c in color_list]

def find_midtown(row):
    if ("Midtown" in row["dropoff_zone"]) or ("Midtown" in row["pickup_zone"]):
        return rng.choice(prop_list, size=len(row))
    else:
        return [None]*len(row)
df[:20].style.apply(find_midtown,axis=1)
  pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.600000 7.000000 2.150000 0.000000 12.950000 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.790000 5.000000 0.000000 0.000000 9.300000 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.370000 7.500000 2.360000 0.000000 14.160000 yellow credit card Alphabet City West Village Manhattan Manhattan
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.700000 27.000000 6.150000 0.000000 36.950000 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.160000 9.000000 1.100000 0.000000 13.400000 yellow credit card Midtown East Yorkville West Manhattan Manhattan
5 2019-03-11 10:37:23 2019-03-11 10:47:31 1 0.490000 7.500000 2.160000 0.000000 12.960000 yellow credit card Times Sq/Theatre District Midtown East Manhattan Manhattan
6 2019-03-26 21:07:31 2019-03-26 21:17:29 1 3.650000 13.000000 2.000000 0.000000 18.800000 yellow credit card Battery Park City Two Bridges/Seward Park Manhattan Manhattan
8 2019-03-23 11:48:50 2019-03-23 12:06:14 1 3.630000 15.000000 1.000000 0.000000 19.300000 yellow credit card East Harlem South Midtown Center Manhattan Manhattan
9 2019-03-08 16:18:37 2019-03-08 16:26:57 1 1.520000 8.000000 1.000000 0.000000 13.300000 yellow credit card Lincoln Square East Central Park Manhattan Manhattan
10 2019-03-16 10:02:25 2019-03-16 10:22:29 1 3.900000 17.000000 0.000000 0.000000 17.800000 yellow cash LaGuardia Airport Astoria Queens Queens
11 2019-03-20 19:39:42 2019-03-20 19:45:36 1 1.530000 6.500000 2.160000 0.000000 12.960000 yellow credit card Upper West Side South Manhattan Valley Manhattan Manhattan
12 2019-03-18 21:27:14 2019-03-18 21:34:16 1 1.050000 6.500000 1.000000 0.000000 11.300000 yellow credit card Murray Hill Midtown Center Manhattan Manhattan
13 2019-03-19 07:55:25 2019-03-19 08:09:17 1 1.750000 10.500000 0.000000 0.000000 13.800000 yellow cash Lincoln Square West Times Sq/Theatre District Manhattan Manhattan
14 2019-03-27 12:13:34 2019-03-27 12:25:48 0 2.900000 11.500000 0.000000 0.000000 14.800000 yellow cash Financial District North Two Bridges/Seward Park Manhattan Manhattan
15 2019-03-16 17:53:34 2019-03-16 18:13:57 3 2.090000 13.500000 0.000000 0.000000 16.800000 yellow cash Upper West Side North Clinton East Manhattan Manhattan
16 2019-03-15 12:36:05 2019-03-15 12:54:28 1 2.120000 13.000000 0.000000 0.000000 16.300000 yellow cash East Chelsea Meatpacking/West Village West Manhattan Manhattan
17 2019-03-23 20:50:49 2019-03-23 21:02:07 1 2.600000 10.500000 2.000000 0.000000 16.300000 yellow credit card Midtown Center East Harlem South Manhattan Manhattan
18 2019-03-27 06:28:36 2019-03-27 06:38:10 1 2.180000 9.500000 1.920000 0.000000 14.720000 yellow credit card Gramercy Midtown Center Manhattan Manhattan
19 2019-03-25 22:04:25 2019-03-25 22:11:30 6 1.080000 6.500000 1.080000 0.000000 11.380000 yellow credit card East Chelsea East Chelsea Manhattan Manhattan
20 2019-03-21 03:37:34 2019-03-21 03:44:13 1 1.070000 6.500000 1.540000 0.000000 11.840000 yellow credit card Penn Station/Madison Sq West Kips Bay Manhattan Manhattan

pandas styler sample exercises:

  1. Try changing the text color to red on all rides where the fare was at least 10 dollars.

  2. For all cells where the pickup time is between 11pm and midnight, try giving those cells a black background with white text.

  3. For how many rides was the tip amount greater than 40% of the fare? Try coloring the entire row for those rides in red.