Week 10, Monday
Contents
Week 10, Monday¶
import seaborn as sns
import numpy as np
import pandas as pd
Lecture¶
We will practice analyzing and cleaning a dataset. This dataset contains scaled versions of the Midterm 2 scores.
Can you identify which problem needed to be curved differently between the two versions of the midterm?
df = pd.read_csv("../data/Midterm_scaled.csv")
df.head()
Student | ID | SIS User ID | SIS Login ID | Section | 1a | 1b | 1c | 1d | 1e | 1f | 2 | 3 | Column that is randomly blank. | Version | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | Out of 3 | Out of 3 | Out of 3 | Out of 3 | Out of 3 | Out of 3 | Out of 4 | Out of 8 | NaN | NaN |
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Chris Davis | 23232.0 | 23251.0 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | -0.717101719 | 0.842970667 | 0.887970408 | 0.972078231 | 0.654139006 | 0.264135272 | 1.072380529 | 0.887667048 | NaN | a |
3 | Chris Davis | 23232.0 | 23251.0 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | 1.223291168 | -1.351934088 | 0.887970408 | 0.972078231 | 0.654139006 | 0.264135272 | 0 | 0.887667048 | NaN | NaN |
4 | Chris Davis | 23232.0 | 23251.0 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | -0.717101719 | -1.351934088 | -1.206729017 | -0.470360434 | -2.081351382 | 0.264135272 | 0 | -1.432371827 | NaN | a |
df.dtypes
Student object
ID float64
SIS User ID float64
SIS Login ID object
Section object
1a object
1b object
1c object
1d object
1e object
1f object
2 object
3 object
Column that is randomly blank. float64
Version object
dtype: object
We want to convert most of those columns to numeric values.
pd.to_numeric(df["1a"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/_libs/lib.pyx:2315, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "Out of 3"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Input In [5], in <module>
----> 1 pd.to_numeric(df["1a"])
File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/core/tools/numeric.py:184, in to_numeric(arg, errors, downcast)
182 coerce_numeric = errors not in ("ignore", "raise")
183 try:
--> 184 values, _ = lib.maybe_convert_numeric(
185 values, set(), coerce_numeric=coerce_numeric
186 )
187 except (ValueError, TypeError):
188 if errors == "raise":
File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/_libs/lib.pyx:2357, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "Out of 3" at position 0
We can fix that error by using the errors
keyword argument.
pd.to_numeric(df["1a"], errors="coerce")
0 NaN
1 NaN
2 -0.717102
3 1.223291
4 -0.717102
5 -0.717102
6 -0.717102
7 -0.717102
8 1.223291
9 -0.717102
10 -0.717102
11 -0.717102
12 -1.687298
13 -0.717102
14 1.223291
15 1.223291
16 -1.687298
17 1.223291
18 -0.717102
19 -0.717102
20 -0.717102
21 -0.717102
22 -0.717102
23 1.223291
24 1.223291
25 1.223291
26 1.223291
27 -0.717102
28 1.223291
29 1.223291
30 -0.717102
31 -0.717102
32 1.223291
33 -0.717102
34 -0.717102
35 1.223291
36 -0.717102
37 -0.717102
38 1.223291
39 -0.717102
40 1.223291
41 -0.717102
42 -0.717102
43 1.223291
44 1.223291
45 -0.717102
46 1.223291
47 -0.717102
Name: 1a, dtype: float64
If we want to do the same thing to all the columns from “1a” to “3”, we can use apply
and a lambda
function.
df.loc[:,"1a":"3"].apply(lambda col: pd.to_numeric(col, errors="coerce"), axis=0)
1a | 1b | 1c | 1d | 1e | 1f | 2 | 3 | |
---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | -0.717102 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | 0.887667 |
3 | 1.223291 | -1.351934 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 0.000000 | 0.887667 |
4 | -0.717102 | -1.351934 | -1.206729 | -0.470360 | -2.081351 | 0.264135 | 0.000000 | -1.432372 |
5 | -0.717102 | -1.351934 | 0.887970 | -0.470360 | 0.654139 | 0.264135 | 0.000000 | -0.968364 |
6 | -0.717102 | -1.351934 | -1.206729 | -0.470360 | 0.654139 | 0.264135 | 0.000000 | -1.432372 |
7 | -0.717102 | -0.620299 | -0.159379 | -0.470360 | 0.654139 | 0.264135 | -1.072381 | -0.968364 |
8 | 1.223291 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | 0.887667 |
9 | -0.717102 | -0.620299 | -1.206729 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | 0.423659 |
10 | -0.717102 | -1.351934 | 0.887970 | -1.912799 | 0.654139 | 0.264135 | 0.000000 | 0.887667 |
11 | -0.717102 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | 0.887667 |
12 | -1.687298 | -1.351934 | -0.159379 | -0.470360 | -2.081351 | 0.264135 | -3.217142 | -0.968364 |
13 | -0.717102 | 0.842971 | -0.159379 | -0.470360 | -1.169521 | 0.264135 | -1.072381 | -1.432372 |
14 | 1.223291 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 0.000000 | 0.887667 |
15 | 1.223291 | 0.842971 | -1.206729 | -0.470360 | 0.654139 | -3.785939 | -1.072381 | -0.504356 |
16 | -1.687298 | -1.351934 | -1.206729 | -0.470360 | -2.081351 | -3.785939 | 0.000000 | -1.896380 |
17 | 1.223291 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | 0.423659 |
18 | -0.717102 | -1.351934 | -0.159379 | -0.470360 | -1.169521 | 0.264135 | 1.072381 | 0.887667 |
19 | -0.717102 | -1.351934 | -0.159379 | -0.470360 | 0.654139 | 0.264135 | 0.000000 | -0.040349 |
20 | -0.717102 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 0.000000 | 0.887667 |
21 | -0.717102 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | 0.887667 |
22 | -0.717102 | 0.842971 | 0.887970 | -0.470360 | 0.654139 | 0.264135 | 1.072381 | 0.887667 |
23 | 1.223291 | -1.351934 | -0.159379 | -0.470360 | -0.257691 | 0.264135 | 0.000000 | -0.504356 |
24 | 1.223291 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | 0.887667 |
25 | 1.223291 | 0.842971 | -1.206729 | -0.470360 | 0.654139 | 0.264135 | 0.000000 | -0.504356 |
26 | 1.223291 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 0.000000 | 0.887667 |
27 | -0.717102 | 0.842971 | -1.206729 | -1.912799 | -1.169521 | 0.264135 | -1.072381 | -1.432372 |
28 | 1.223291 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 0.000000 | 0.887667 |
29 | 1.223291 | 0.842971 | -1.206729 | -1.912799 | 0.654139 | 0.264135 | 1.072381 | -0.040349 |
30 | -0.717102 | 0.842971 | -1.206729 | 0.972078 | -1.169521 | 0.264135 | 0.000000 | 0.887667 |
31 | -0.717102 | 0.842971 | -0.159379 | -0.470360 | 0.654139 | 0.264135 | -1.072381 | -0.968364 |
32 | 1.223291 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | 0.887667 |
33 | -0.717102 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | 0.887667 |
34 | -0.717102 | 0.842971 | -0.159379 | -0.470360 | -0.257691 | 0.264135 | -1.072381 | -0.968364 |
35 | 1.223291 | 0.842971 | -2.254079 | -0.470360 | 0.654139 | 0.264135 | 0.000000 | -0.040349 |
36 | -0.717102 | -1.351934 | -1.206729 | -0.470360 | -1.169521 | 0.264135 | -3.217142 | -0.040349 |
37 | -0.717102 | -1.351934 | -1.206729 | -0.470360 | 0.654139 | 0.264135 | 0.000000 | 0.887667 |
38 | 1.223291 | 0.842971 | -1.206729 | -0.470360 | -1.169521 | 0.264135 | 0.000000 | -0.504356 |
39 | -0.717102 | -1.351934 | -2.254079 | -3.355238 | -2.081351 | 0.264135 | -1.072381 | -1.896380 |
40 | 1.223291 | -0.620299 | 0.887970 | -0.470360 | -1.169521 | 0.264135 | -1.072381 | 0.887667 |
41 | -0.717102 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | -0.040349 |
42 | -0.717102 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 0.000000 | 0.887667 |
43 | 1.223291 | -0.620299 | 0.887970 | 0.972078 | -1.169521 | 0.264135 | 0.000000 | -0.040349 |
44 | 1.223291 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | 0.887667 |
45 | -0.717102 | -0.620299 | 0.887970 | 0.972078 | 0.654139 | -3.785939 | 0.000000 | -0.040349 |
46 | 1.223291 | -1.351934 | -0.159379 | -0.470360 | -2.081351 | 0.264135 | 0.000000 | -2.824395 |
47 | -0.717102 | 0.111336 | 0.887970 | -0.470360 | 0.654139 | 0.264135 | 0.000000 | 0.887667 |
For this particular dataset, a much easier strategy is just to specify during the import that we want to skip the top two rows (after the header row).
df = pd.read_csv("../data/Midterm_scaled.csv", skiprows=[1,2])
df.head()
Student | ID | SIS User ID | SIS Login ID | Section | 1a | 1b | 1c | 1d | 1e | 1f | 2 | 3 | Column that is randomly blank. | Version | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | -0.717102 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | 0.887667 | NaN | a |
1 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | 1.223291 | -1.351934 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 0.000000 | 0.887667 | NaN | NaN |
2 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | -0.717102 | -1.351934 | -1.206729 | -0.470360 | -2.081351 | 0.264135 | 0.000000 | -1.432372 | NaN | a |
3 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | -0.717102 | -1.351934 | 0.887970 | -0.470360 | 0.654139 | 0.264135 | 0.000000 | -0.968364 | NaN | NaN |
4 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | -0.717102 | -1.351934 | -1.206729 | -0.470360 | 0.654139 | 0.264135 | 0.000000 | -1.432372 | NaN | NaN |
The original dataset only specifies who had version “a”, not version “b”. Let’s fill in version “b” in place of the NaN
values.
df["Version"] = df["Version"].fillna("b")
Here are some examples using df.groupby
. There are more examples below in the Worksheet portion.
This shows all possible combinations of (scaled) scores on problems “1a” and “1b”. For example, this shows that 3 students scored -0.717101719 on 1a and -0.62029917 on 1b.
for x,y in df.groupby(["1a","1b"]):
print(f"The value of x is {x}")
print(y.shape)
print("")
The value of x is (-1.687298162, -1.351934088)
(2, 15)
The value of x is (-0.717101719, -1.351934088)
(9, 15)
The value of x is (-0.717101719, -0.62029917)
(3, 15)
The value of x is (-0.717101719, 0.111335748)
(1, 15)
The value of x is (-0.717101719, 0.842970667)
(13, 15)
The value of x is (1.223291168, -1.351934088)
(3, 15)
The value of x is (1.223291168, -0.62029917)
(2, 15)
The value of x is (1.223291168, 0.842970667)
(13, 15)
For our question of whether one version was easier than the other version, we are interested in grouping by “Version”.
for x,y in df.groupby("Version"):
print(f"The value of x is {x}")
print(y.shape)
The value of x is a
(21, 15)
The value of x is b
(25, 15)
In this code, y
is a DataFrame. Notice for example how in the first displayed DataFrame, the exams are all Version a.
for x,y in df.groupby("Version"):
print(f"The value of x is {x}")
display(y.head())
The value of x is a
Student | ID | SIS User ID | SIS Login ID | Section | 1a | 1b | 1c | 1d | 1e | 1f | 2 | 3 | Column that is randomly blank. | Version | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | -0.717102 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | 0.887667 | NaN | a |
2 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | -0.717102 | -1.351934 | -1.206729 | -0.470360 | -2.081351 | 0.264135 | 0.000000 | -1.432372 | NaN | a |
5 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | -0.717102 | -0.620299 | -0.159379 | -0.470360 | 0.654139 | 0.264135 | -1.072381 | -0.968364 | NaN | a |
11 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | -0.717102 | 0.842971 | -0.159379 | -0.470360 | -1.169521 | 0.264135 | -1.072381 | -1.432372 | NaN | a |
14 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | -1.687298 | -1.351934 | -1.206729 | -0.470360 | -2.081351 | -3.785939 | 0.000000 | -1.896380 | NaN | a |
The value of x is b
Student | ID | SIS User ID | SIS Login ID | Section | 1a | 1b | 1c | 1d | 1e | 1f | 2 | 3 | Column that is randomly blank. | Version | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | 1.223291 | -1.351934 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 0.000000 | 0.887667 | NaN | b |
3 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | -0.717102 | -1.351934 | 0.887970 | -0.470360 | 0.654139 | 0.264135 | 0.000000 | -0.968364 | NaN | b |
4 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | -0.717102 | -1.351934 | -1.206729 | -0.470360 | 0.654139 | 0.264135 | 0.000000 | -1.432372 | NaN | b |
6 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | 1.223291 | 0.842971 | 0.887970 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | 0.887667 | NaN | b |
7 | Chris Davis | 23232 | 23251 | daviscj | MATH 10 LEC A: INTRO PROG DATA SCI (44870) | -0.717102 | -0.620299 | -1.206729 | 0.972078 | 0.654139 | 0.264135 | 1.072381 | 0.423659 | NaN | b |
Instead of iterating over the different possibilities, we can also perform what is called an aggregation operation, such as taking the mean
.
df.groupby("Version").mean()
ID | SIS User ID | 1a | 1b | 1c | 1d | 1e | 1f | 2 | 3 | Column that is randomly blank. | |
---|---|---|---|---|---|---|---|---|---|---|---|
Version | |||||||||||
a | 23232.0 | 23251.0 | -0.024104 | 0.355214 | -0.109506 | -0.126923 | -0.084009 | 0.071275 | 0.153197 | -0.018253 | NaN |
b | 23232.0 | 23251.0 | 0.020248 | -0.298380 | 0.091985 | 0.106615 | 0.070568 | -0.059871 | -0.128686 | 0.015332 | NaN |
It’s a little easier to read if we take the transpose.
df.groupby("Version").mean().T
Version | a | b |
---|---|---|
ID | 23232.000000 | 23232.000000 |
SIS User ID | 23251.000000 | 23251.000000 |
1a | -0.024104 | 0.020248 |
1b | 0.355214 | -0.298380 |
1c | -0.109506 | 0.091985 |
1d | -0.126923 | 0.106615 |
1e | -0.084009 | 0.070568 |
1f | 0.071275 | -0.059871 |
2 | 0.153197 | -0.128686 |
3 | -0.018253 | 0.015332 |
Column that is randomly blank. | NaN | NaN |
We can also apply formatting to these strings, by saying we only want three decimal places.
df.groupby("Version").mean().T.applymap(lambda x: f"{x:.3f}")
Version | a | b |
---|---|---|
ID | 23232.000 | 23232.000 |
SIS User ID | 23251.000 | 23251.000 |
1a | -0.024 | 0.020 |
1b | 0.355 | -0.298 |
1c | -0.110 | 0.092 |
1d | -0.127 | 0.107 |
1e | -0.084 | 0.071 |
1f | 0.071 | -0.060 |
2 | 0.153 | -0.129 |
3 | -0.018 | 0.015 |
Column that is randomly blank. | nan | nan |
Notice how the 1b value is significantly higher in the “a” column than in the “b” column. This is the reason that the “b” version of the exam was curved one point more than the “a” version.
Worksheet¶
(This worksheet contains some repetition from the portion above.)
df = sns.load_dataset("taxis")
df.dropna(inplace=True)
Practice with pandas groupby¶
We haven’t covered pandas groupby in Math 10 before today. This is a possible “extra topic” for the course project.
Here is an example using groupby
. We also use f-strings.
for a,b in df.groupby("pickup_zone"):
print(f"a is {a}")
print(f"The type of b is {type(b)}")
print(f"The shape of b is {b.shape}")
break
a is Allerton/Pelham Gardens
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (2, 14)
If we instead wanted to get the first 5 values, we could do something like the following. For example, this indicates that 65 rides began in the pickup zone “Astoria”.
counter = 0
for a,b in df.groupby("pickup_zone"):
print(f"a is {a}")
print(f"The type of b is {type(b)}")
print(f"The shape of b is {b.shape}")
print("")
counter += 1
if counter >= 5:
break
a is Allerton/Pelham Gardens
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (2, 14)
a is Alphabet City
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (9, 14)
a is Astoria
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (65, 14)
a is Auburndale
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (1, 14)
a is Battery Park
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (1, 14)
You can also group by multiple categories. For example, the following indicates that only 4 rides in the dataset began in Bronx and finished in Brooklyn.
counter = 0
for a,b in df.groupby(["pickup_borough","dropoff_borough"]):
print(f"a is {a}")
print(f"The type of b is {type(b)}")
print(f"The shape of b is {b.shape}")
print("")
counter += 1
if counter >= 5:
break
a is ('Bronx', 'Bronx')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (66, 14)
a is ('Bronx', 'Brooklyn')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (4, 14)
a is ('Bronx', 'Manhattan')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (25, 14)
a is ('Bronx', 'Queens')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (4, 14)
a is ('Brooklyn', 'Bronx')
The type of b is <class 'pandas.core.frame.DataFrame'>
The shape of b is (5, 14)
Sample exercises:
For each pickup borough, using f-strings, print the average tip for rides that begin in that borough.
Try producing a sub-DataFrame,
df_sub
, which contains only the “distance”, “fare”, “tip”, and “pickup_zone” columns, and which contains only rows where the “tip” amount is greater than zero. Then executedf_sub.groupby("pickup_zone").mean()
. What information is this providing?Do the same thing as in the previous exercise, but instead find what the maximum was in each category, instead of the average.
Practice with pandas styler¶
We haven’t covered pandas styler in Math 10. This is a possible “extra topic” for the course project.
Based on the pandas documentation.
As an example, we will color the cells blue for which the “pickup_zone” or “dropoff_zone” contains the word “Midtown”.
def make_blue(x):
if "Midtown" in x:
return 'color:white;background-color:darkblue'
else:
return None
You will have to scroll right to see the blue cells. We only display the first 20 rows.
df[:20].style.applymap(make_blue,subset=["pickup_zone","dropoff_zone"])
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.600000 | 7.000000 | 2.150000 | 0.000000 | 12.950000 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.790000 | 5.000000 | 0.000000 | 0.000000 | 9.300000 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.370000 | 7.500000 | 2.360000 | 0.000000 | 14.160000 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.700000 | 27.000000 | 6.150000 | 0.000000 | 36.950000 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.160000 | 9.000000 | 1.100000 | 0.000000 | 13.400000 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
5 | 2019-03-11 10:37:23 | 2019-03-11 10:47:31 | 1 | 0.490000 | 7.500000 | 2.160000 | 0.000000 | 12.960000 | yellow | credit card | Times Sq/Theatre District | Midtown East | Manhattan | Manhattan |
6 | 2019-03-26 21:07:31 | 2019-03-26 21:17:29 | 1 | 3.650000 | 13.000000 | 2.000000 | 0.000000 | 18.800000 | yellow | credit card | Battery Park City | Two Bridges/Seward Park | Manhattan | Manhattan |
8 | 2019-03-23 11:48:50 | 2019-03-23 12:06:14 | 1 | 3.630000 | 15.000000 | 1.000000 | 0.000000 | 19.300000 | yellow | credit card | East Harlem South | Midtown Center | Manhattan | Manhattan |
9 | 2019-03-08 16:18:37 | 2019-03-08 16:26:57 | 1 | 1.520000 | 8.000000 | 1.000000 | 0.000000 | 13.300000 | yellow | credit card | Lincoln Square East | Central Park | Manhattan | Manhattan |
10 | 2019-03-16 10:02:25 | 2019-03-16 10:22:29 | 1 | 3.900000 | 17.000000 | 0.000000 | 0.000000 | 17.800000 | yellow | cash | LaGuardia Airport | Astoria | Queens | Queens |
11 | 2019-03-20 19:39:42 | 2019-03-20 19:45:36 | 1 | 1.530000 | 6.500000 | 2.160000 | 0.000000 | 12.960000 | yellow | credit card | Upper West Side South | Manhattan Valley | Manhattan | Manhattan |
12 | 2019-03-18 21:27:14 | 2019-03-18 21:34:16 | 1 | 1.050000 | 6.500000 | 1.000000 | 0.000000 | 11.300000 | yellow | credit card | Murray Hill | Midtown Center | Manhattan | Manhattan |
13 | 2019-03-19 07:55:25 | 2019-03-19 08:09:17 | 1 | 1.750000 | 10.500000 | 0.000000 | 0.000000 | 13.800000 | yellow | cash | Lincoln Square West | Times Sq/Theatre District | Manhattan | Manhattan |
14 | 2019-03-27 12:13:34 | 2019-03-27 12:25:48 | 0 | 2.900000 | 11.500000 | 0.000000 | 0.000000 | 14.800000 | yellow | cash | Financial District North | Two Bridges/Seward Park | Manhattan | Manhattan |
15 | 2019-03-16 17:53:34 | 2019-03-16 18:13:57 | 3 | 2.090000 | 13.500000 | 0.000000 | 0.000000 | 16.800000 | yellow | cash | Upper West Side North | Clinton East | Manhattan | Manhattan |
16 | 2019-03-15 12:36:05 | 2019-03-15 12:54:28 | 1 | 2.120000 | 13.000000 | 0.000000 | 0.000000 | 16.300000 | yellow | cash | East Chelsea | Meatpacking/West Village West | Manhattan | Manhattan |
17 | 2019-03-23 20:50:49 | 2019-03-23 21:02:07 | 1 | 2.600000 | 10.500000 | 2.000000 | 0.000000 | 16.300000 | yellow | credit card | Midtown Center | East Harlem South | Manhattan | Manhattan |
18 | 2019-03-27 06:28:36 | 2019-03-27 06:38:10 | 1 | 2.180000 | 9.500000 | 1.920000 | 0.000000 | 14.720000 | yellow | credit card | Gramercy | Midtown Center | Manhattan | Manhattan |
19 | 2019-03-25 22:04:25 | 2019-03-25 22:11:30 | 6 | 1.080000 | 6.500000 | 1.080000 | 0.000000 | 11.380000 | yellow | credit card | East Chelsea | East Chelsea | Manhattan | Manhattan |
20 | 2019-03-21 03:37:34 | 2019-03-21 03:44:13 | 1 | 1.070000 | 6.500000 | 1.540000 | 0.000000 | 11.840000 | yellow | credit card | Penn Station/Madison Sq West | Kips Bay | Manhattan | Manhattan |
Here is a similar example, but where we color every cell in the row a random color. Notice the use of f-strings.
rng = np.random.default_rng()
color_list = ["red","purple","orange","wheat","black","blue"]
prop_list = [f'color:white;background-color:{c}' for c in color_list]
def find_midtown(row):
if ("Midtown" in row["dropoff_zone"]) or ("Midtown" in row["pickup_zone"]):
return rng.choice(prop_list, size=len(row))
else:
return [None]*len(row)
df[:20].style.apply(find_midtown,axis=1)
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.600000 | 7.000000 | 2.150000 | 0.000000 | 12.950000 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.790000 | 5.000000 | 0.000000 | 0.000000 | 9.300000 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.370000 | 7.500000 | 2.360000 | 0.000000 | 14.160000 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.700000 | 27.000000 | 6.150000 | 0.000000 | 36.950000 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.160000 | 9.000000 | 1.100000 | 0.000000 | 13.400000 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
5 | 2019-03-11 10:37:23 | 2019-03-11 10:47:31 | 1 | 0.490000 | 7.500000 | 2.160000 | 0.000000 | 12.960000 | yellow | credit card | Times Sq/Theatre District | Midtown East | Manhattan | Manhattan |
6 | 2019-03-26 21:07:31 | 2019-03-26 21:17:29 | 1 | 3.650000 | 13.000000 | 2.000000 | 0.000000 | 18.800000 | yellow | credit card | Battery Park City | Two Bridges/Seward Park | Manhattan | Manhattan |
8 | 2019-03-23 11:48:50 | 2019-03-23 12:06:14 | 1 | 3.630000 | 15.000000 | 1.000000 | 0.000000 | 19.300000 | yellow | credit card | East Harlem South | Midtown Center | Manhattan | Manhattan |
9 | 2019-03-08 16:18:37 | 2019-03-08 16:26:57 | 1 | 1.520000 | 8.000000 | 1.000000 | 0.000000 | 13.300000 | yellow | credit card | Lincoln Square East | Central Park | Manhattan | Manhattan |
10 | 2019-03-16 10:02:25 | 2019-03-16 10:22:29 | 1 | 3.900000 | 17.000000 | 0.000000 | 0.000000 | 17.800000 | yellow | cash | LaGuardia Airport | Astoria | Queens | Queens |
11 | 2019-03-20 19:39:42 | 2019-03-20 19:45:36 | 1 | 1.530000 | 6.500000 | 2.160000 | 0.000000 | 12.960000 | yellow | credit card | Upper West Side South | Manhattan Valley | Manhattan | Manhattan |
12 | 2019-03-18 21:27:14 | 2019-03-18 21:34:16 | 1 | 1.050000 | 6.500000 | 1.000000 | 0.000000 | 11.300000 | yellow | credit card | Murray Hill | Midtown Center | Manhattan | Manhattan |
13 | 2019-03-19 07:55:25 | 2019-03-19 08:09:17 | 1 | 1.750000 | 10.500000 | 0.000000 | 0.000000 | 13.800000 | yellow | cash | Lincoln Square West | Times Sq/Theatre District | Manhattan | Manhattan |
14 | 2019-03-27 12:13:34 | 2019-03-27 12:25:48 | 0 | 2.900000 | 11.500000 | 0.000000 | 0.000000 | 14.800000 | yellow | cash | Financial District North | Two Bridges/Seward Park | Manhattan | Manhattan |
15 | 2019-03-16 17:53:34 | 2019-03-16 18:13:57 | 3 | 2.090000 | 13.500000 | 0.000000 | 0.000000 | 16.800000 | yellow | cash | Upper West Side North | Clinton East | Manhattan | Manhattan |
16 | 2019-03-15 12:36:05 | 2019-03-15 12:54:28 | 1 | 2.120000 | 13.000000 | 0.000000 | 0.000000 | 16.300000 | yellow | cash | East Chelsea | Meatpacking/West Village West | Manhattan | Manhattan |
17 | 2019-03-23 20:50:49 | 2019-03-23 21:02:07 | 1 | 2.600000 | 10.500000 | 2.000000 | 0.000000 | 16.300000 | yellow | credit card | Midtown Center | East Harlem South | Manhattan | Manhattan |
18 | 2019-03-27 06:28:36 | 2019-03-27 06:38:10 | 1 | 2.180000 | 9.500000 | 1.920000 | 0.000000 | 14.720000 | yellow | credit card | Gramercy | Midtown Center | Manhattan | Manhattan |
19 | 2019-03-25 22:04:25 | 2019-03-25 22:11:30 | 6 | 1.080000 | 6.500000 | 1.080000 | 0.000000 | 11.380000 | yellow | credit card | East Chelsea | East Chelsea | Manhattan | Manhattan |
20 | 2019-03-21 03:37:34 | 2019-03-21 03:44:13 | 1 | 1.070000 | 6.500000 | 1.540000 | 0.000000 | 11.840000 | yellow | credit card | Penn Station/Madison Sq West | Kips Bay | Manhattan | Manhattan |
pandas styler sample exercises:
Try changing the text color to red on all rides where the fare was at least 10 dollars.
For all cells where the pickup time is between 11pm and midnight, try giving those cells a black background with white text.
For how many rides was the tip amount greater than 40% of the fare? Try coloring the entire row for those rides in red.