Week 4 Friday#
Announcements#
Plan today: About 15 minutes to work on Worksheets/fill out your notecard/study for the midterm. Then lecture. Let me know what you want to review!
The midterm is Monday. A sample midterm and solutions are posted on the Week 5 page on Canvas.
Let me know if you still need a notecard.
The best way to study is to go over the worksheets, quizzes, and the sample midterm. Next priority would be the lecture notes.
I briefly mention lots of topics. If you’re not sure if something could appear on the midterm, ask on Ed Discussion and I’ll answer there.
Both Worksheets 7-8 are due Monday night (the usual schedule). Working on them will help you prepare for the midterm.
Leftovers from Wednesday#
Here is material I had prepared for Wednesday. I am equally happy to discuss whatever you think is most helpful to review.
Load the “taxis” dataset from Seaborn and drop rows with missing values.
import seaborn as sns
import altair as alt
import pandas as pd
df = sns.load_dataset("taxis")
df = df.dropna(axis=0)
Here is a reminder of how the dataset looks.
df.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
Multiply the values in columns from “distance” to “tolls” by 10
The following is a common mistake. With this syntax, pandas is looking for rows with the names “distance” to “tolls”, and since there aren’t any, the result is an “empty” DataFrame.
df.loc["distance":"tolls"]
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough |
---|
Here is the “right” way to do this. Notice the :
inside the square brackets, which says we want “every row”.
df.loc[:, "distance":"tolls"]*10
distance | fare | tip | tolls | |
---|---|---|---|---|
0 | 16.0 | 70.0 | 21.5 | 0.0 |
1 | 7.9 | 50.0 | 0.0 | 0.0 |
2 | 13.7 | 75.0 | 23.6 | 0.0 |
3 | 77.0 | 270.0 | 61.5 | 0.0 |
4 | 21.6 | 90.0 | 11.0 | 0.0 |
... | ... | ... | ... | ... |
6428 | 7.5 | 45.0 | 10.6 | 0.0 |
6429 | 187.4 | 580.0 | 0.0 | 0.0 |
6430 | 41.4 | 160.0 | 0.0 | 0.0 |
6431 | 11.2 | 60.0 | 0.0 | 0.0 |
6432 | 38.5 | 150.0 | 33.6 | 0.0 |
6341 rows Ă— 4 columns
Here is another way using a lambda function and applymap
. This is definitely more confusing than simply doing *10
like we did above; I am just showing it as an example of using applymap
to perform this basic task.
df.loc[:, "distance":"tolls"].applymap(lambda x: 10*x)
distance | fare | tip | tolls | |
---|---|---|---|---|
0 | 16.0 | 70.0 | 21.5 | 0.0 |
1 | 7.9 | 50.0 | 0.0 | 0.0 |
2 | 13.7 | 75.0 | 23.6 | 0.0 |
3 | 77.0 | 270.0 | 61.5 | 0.0 |
4 | 21.6 | 90.0 | 11.0 | 0.0 |
... | ... | ... | ... | ... |
6428 | 7.5 | 45.0 | 10.6 | 0.0 |
6429 | 187.4 | 580.0 | 0.0 | 0.0 |
6430 | 41.4 | 160.0 | 0.0 | 0.0 |
6431 | 11.2 | 60.0 | 0.0 | 0.0 |
6432 | 38.5 | 150.0 | 33.6 | 0.0 |
6341 rows Ă— 4 columns
The following doesn’t work because iloc
slicing should be done using integers, not strings, so we need to get rid of the quotation marks.
df.iloc[:, "3":"7"]*10
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [9], line 1
----> 1 df.iloc[:, "3":"7"]*10
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:889, in _LocationIndexer.__getitem__(self, key)
886 with suppress(KeyError, IndexError, AttributeError):
887 # AttributeError for IntervalTree get_value
888 return self.obj._get_value(*key, takeable=self._takeable)
--> 889 return self._getitem_tuple(key)
890 else:
891 # we by definition only have the 0th axis
892 axis = self.axis or 0
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1454, in _iLocIndexer._getitem_tuple(self, tup)
1451 with suppress(IndexingError):
1452 return self._getitem_lowerdim(tup)
-> 1454 return self._getitem_tuple_same_dim(tup)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:775, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
772 if com.is_null_slice(key):
773 continue
--> 775 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
776 # We should never have retval.ndim < self.ndim, as that should
777 # be handled by the _getitem_lowerdim call above.
778 assert retval.ndim == self.ndim
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1481, in _iLocIndexer._getitem_axis(self, key, axis)
1479 def _getitem_axis(self, key, axis: int):
1480 if isinstance(key, slice):
-> 1481 return self._get_slice_axis(key, axis=axis)
1483 if isinstance(key, list):
1484 key = np.asarray(key)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1513, in _iLocIndexer._get_slice_axis(self, slice_obj, axis)
1510 return obj.copy(deep=False)
1512 labels = obj._get_axis(axis)
-> 1513 labels._validate_positional_slice(slice_obj)
1514 return self.obj._slice(slice_obj, axis=axis)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/base.py:3321, in Index._validate_positional_slice(self, key)
3315 @final
3316 def _validate_positional_slice(self, key: slice):
3317 """
3318 For positional indexing, a slice must have either int or None
3319 for each of start, stop, and step.
3320 """
-> 3321 self._validate_indexer("positional", key.start, "iloc")
3322 self._validate_indexer("positional", key.stop, "iloc")
3323 self._validate_indexer("positional", key.step, "iloc")
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/base.py:5310, in Index._validate_indexer(self, form, key, kind)
5308 pass
5309 else:
-> 5310 raise self._invalid_indexer(form, key)
TypeError: cannot do positional indexing on Index with these indexers [3] of type str
If you want to have staggered rows, not in a regular pattern, you need to write them all out explicitly in a list, like this.
df[["distance", "pickup", "pickup"]]
distance | pickup | pickup | |
---|---|---|---|
0 | 1.60 | 2019-03-23 20:21:09 | 2019-03-23 20:21:09 |
1 | 0.79 | 2019-03-04 16:11:55 | 2019-03-04 16:11:55 |
2 | 1.37 | 2019-03-27 17:53:01 | 2019-03-27 17:53:01 |
3 | 7.70 | 2019-03-10 01:23:59 | 2019-03-10 01:23:59 |
4 | 2.16 | 2019-03-30 13:27:42 | 2019-03-30 13:27:42 |
... | ... | ... | ... |
6428 | 0.75 | 2019-03-31 09:51:53 | 2019-03-31 09:51:53 |
6429 | 18.74 | 2019-03-31 17:38:00 | 2019-03-31 17:38:00 |
6430 | 4.14 | 2019-03-23 22:55:18 | 2019-03-23 22:55:18 |
6431 | 1.12 | 2019-03-04 10:09:25 | 2019-03-04 10:09:25 |
6432 | 3.85 | 2019-03-13 19:31:22 | 2019-03-13 19:31:22 |
6341 rows Ă— 3 columns
Remember that map
is used for a Series and applymap
is used for a DataFrame. (Otherwise they are very similar.) Here is an example of the error we get if we try to use map
on a DataFrame.
df.iloc[:, 3:7].map(lambda x: 10*x)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [13], line 1
----> 1 df.iloc[:, 3:7].map(lambda x: 10*x)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:5465, in NDFrame.__getattr__(self, name)
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5464 return self[name]
-> 5465 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'map'
Here is an example of a pandas Series. The prototypical example of a pandas Series is a column (or row) from a pandas DataFrame.
df["distance"]
0 1.60
1 0.79
2 1.37
3 7.70
4 2.16
...
6428 0.75
6429 18.74
6430 4.14
6431 1.12
6432 3.85
Name: distance, Length: 6341, dtype: float64
If we try to use applymap
on a pandas Series, we get a similar error to what we got above.
df["distance"].applymap(lambda zzz: 10*zzz)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [16], line 1
----> 1 df["distance"].applymap(lambda zzz: 10*zzz)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:5465, in NDFrame.__getattr__(self, name)
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5464 return self[name]
-> 5465 return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'applymap'
If we use map
on a pandas Series, on the other hand, it works. Notice that we can use any variable name we want in our lambda function; here we use zzz
for our variable name.
df["distance"].map(lambda zzz: 10*zzz)
0 16.0
1 7.9
2 13.7
3 77.0
4 21.6
...
6428 7.5
6429 187.4
6430 41.4
6431 11.2
6432 38.5
Name: distance, Length: 6341, dtype: float64
Here is an example of slicing using iloc
. We get the column at integer location 3
through the column at integer location 7
(exclusive).
df.iloc[:, 3:7]
distance | fare | tip | tolls | |
---|---|---|---|---|
0 | 1.60 | 7.0 | 2.15 | 0.0 |
1 | 0.79 | 5.0 | 0.00 | 0.0 |
2 | 1.37 | 7.5 | 2.36 | 0.0 |
3 | 7.70 | 27.0 | 6.15 | 0.0 |
4 | 2.16 | 9.0 | 1.10 | 0.0 |
... | ... | ... | ... | ... |
6428 | 0.75 | 4.5 | 1.06 | 0.0 |
6429 | 18.74 | 58.0 | 0.00 | 0.0 |
6430 | 4.14 | 16.0 | 0.00 | 0.0 |
6431 | 1.12 | 6.0 | 0.00 | 0.0 |
6432 | 3.85 | 15.0 | 3.36 | 0.0 |
6341 rows Ă— 4 columns
Here is some leftover code from Wednesday. I don’t plan to require list comprehension on the midterm, but I may ask you to make a list, and list comprehension is often the most convenient way to make a list.
day_list = [pd.to_datetime(f"4-{i}-2023").day_name() for i in range(24, 31)]
day_list
['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
Practice with groupby
and f-strings#
Make a new column named “Day” in the DataFrame, containing the day of the pickup (as a string, like
"Monday"
).
Here is a reminder of what df
looks like.
df
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6428 | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 | 1 | 0.75 | 4.5 | 1.06 | 0.0 | 6.36 | green | credit card | East Harlem North | Central Harlem North | Manhattan | Manhattan |
6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | 1 | 18.74 | 58.0 | 0.00 | 0.0 | 58.80 | green | credit card | Jamaica | East Concourse/Concourse Village | Queens | Bronx |
6430 | 2019-03-23 22:55:18 | 2019-03-23 23:14:25 | 1 | 4.14 | 16.0 | 0.00 | 0.0 | 17.30 | green | cash | Crown Heights North | Bushwick North | Brooklyn | Brooklyn |
6431 | 2019-03-04 10:09:25 | 2019-03-04 10:14:29 | 1 | 1.12 | 6.0 | 0.00 | 0.0 | 6.80 | green | credit card | East New York | East Flatbush/Remsen Village | Brooklyn | Brooklyn |
6432 | 2019-03-13 19:31:22 | 2019-03-13 19:48:02 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 | green | credit card | Boerum Hill | Windsor Terrace | Brooklyn | Brooklyn |
6341 rows Ă— 14 columns
Here we add a new column to the far right side of the DataFrame. Notice that, unlike usual, we did not need to use pd.to_datetime
.
df["Day"] = df["pickup"].dt.day_name()
df
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | Day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan | Saturday |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan | Monday |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan | Wednesday |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan | Sunday |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan | Saturday |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6428 | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 | 1 | 0.75 | 4.5 | 1.06 | 0.0 | 6.36 | green | credit card | East Harlem North | Central Harlem North | Manhattan | Manhattan | Sunday |
6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | 1 | 18.74 | 58.0 | 0.00 | 0.0 | 58.80 | green | credit card | Jamaica | East Concourse/Concourse Village | Queens | Bronx | Sunday |
6430 | 2019-03-23 22:55:18 | 2019-03-23 23:14:25 | 1 | 4.14 | 16.0 | 0.00 | 0.0 | 17.30 | green | cash | Crown Heights North | Bushwick North | Brooklyn | Brooklyn | Saturday |
6431 | 2019-03-04 10:09:25 | 2019-03-04 10:14:29 | 1 | 1.12 | 6.0 | 0.00 | 0.0 | 6.80 | green | credit card | East New York | East Flatbush/Remsen Village | Brooklyn | Brooklyn | Monday |
6432 | 2019-03-13 19:31:22 | 2019-03-13 19:48:02 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 | green | credit card | Boerum Hill | Windsor Terrace | Brooklyn | Brooklyn | Wednesday |
6341 rows Ă— 15 columns
Why didn’t we have to use
pd.to_datetime
? Evaluatedf.dtypes
anddf.dtypes["pickup"]
to see the reason.
The reason is that when we imported the DataFrame from Seaborn (as opposed to from an attached csv file), Seaborn automatically specified the appropriate data types for these columns.
df.dtypes
pickup datetime64[ns]
dropoff datetime64[ns]
passengers int64
distance float64
fare float64
tip float64
tolls float64
total float64
color object
payment object
pickup_zone object
dropoff_zone object
pickup_borough object
dropoff_borough object
Day object
dtype: object
The above result, df.dtypes
, is another example of a pandas Series (like a column). Here is an example of getting the value corresponding to the index entry “pickup”. (The value is displayed a little strangely, but it’s indicating that this column has a datetime data type.)
df.dtypes["pickup"]
dtype('<M8[ns]')
If we look at the value corresponding to “distance”, it’s easier to read, because we recognize the term float
for this numeric column.
df.dtypes["distance"]
dtype('float64')
What day occurs most often in this dataset?
Here is a nice way to find that. The value_counts
method automatically sorts the results from most common to least common. The output of value_counts()
is another example of a pandas Series. The index in this case is the unique entries occurring, and the values are how often they occur.
vc = df["Day"].value_counts()
vc
Friday 1091
Saturday 1028
Wednesday 958
Thursday 896
Sunday 857
Tuesday 811
Monday 700
Name: Day, dtype: int64
Here we check the index explicitly, using the index
attribute.
vc.index
Index(['Friday', 'Saturday', 'Wednesday', 'Thursday', 'Sunday', 'Tuesday',
'Monday'],
dtype='object')
If instead we want the values, we can use the values
attribute.
vc.values
array([1091, 1028, 958, 896, 857, 811, 700])
If you don’t care about the frequency of the entries, you can instead use the unique
method.
df["Day"].unique()
array(['Saturday', 'Monday', 'Wednesday', 'Sunday', 'Tuesday', 'Friday',
'Thursday'], dtype=object)
If we want to know the most frequent entry, we can get the initial element in the value_counts
index.
vc.index[0]
'Friday'
Can you answer the same thing using Altair?
By default, Altair does not allow us to use DataFrames with more than 5000 rows. This is an example of the error message you’ll see if you try.
alt.Chart(df).mark_bar().encode(
x="Day",
y="count()"
)
---------------------------------------------------------------------------
MaxRowsError Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:2020, in Chart.to_dict(self, *args, **kwargs)
2018 copy.data = core.InlineData(values=[{}])
2019 return super(Chart, copy).to_dict(*args, **kwargs)
-> 2020 return super().to_dict(*args, **kwargs)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:374, in TopLevelMixin.to_dict(self, *args, **kwargs)
372 copy = self.copy(deep=False)
373 original_data = getattr(copy, "data", Undefined)
--> 374 copy.data = _prepare_data(original_data, context)
376 if original_data is not Undefined:
377 context["data"] = original_data
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:89, in _prepare_data(data, context)
87 # convert dataframes or objects with __geo_interface__ to dict
88 if isinstance(data, pd.DataFrame) or hasattr(data, "__geo_interface__"):
---> 89 data = _pipe(data, data_transformers.get())
91 # convert string input to a URLData
92 if isinstance(data, str):
File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
608 """ Pipe a value through a sequence of functions
609
610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
(...)
625 thread_last
626 """
627 for func in funcs:
--> 628 data = func(data)
629 return data
File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
302 def __call__(self, *args, **kwargs):
303 try:
--> 304 return self._partial(*args, **kwargs)
305 except TypeError as exc:
306 if self._should_curry(args, kwargs, exc):
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/data.py:19, in default_data_transformer(data, max_rows)
17 @curried.curry
18 def default_data_transformer(data, max_rows=5000):
---> 19 return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
608 """ Pipe a value through a sequence of functions
609
610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
(...)
625 thread_last
626 """
627 for func in funcs:
--> 628 data = func(data)
629 return data
File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
302 def __call__(self, *args, **kwargs):
303 try:
--> 304 return self._partial(*args, **kwargs)
305 except TypeError as exc:
306 if self._should_curry(args, kwargs, exc):
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/data.py:80, in limit_rows(data, max_rows)
78 return data
79 if max_rows is not None and len(values) > max_rows:
---> 80 raise MaxRowsError(
81 "The number of rows in your dataset is greater "
82 "than the maximum allowed ({}). "
83 "For information on how to plot larger datasets "
84 "in Altair, see the documentation".format(max_rows)
85 )
86 return data
MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000). For information on how to plot larger datasets in Altair, see the documentation
alt.Chart(...)
Here we tell Altair to allow up to 7,000 rows. Warning: Don’t try to do this with huge datasets. I wouldn’t try it with more than about 20,000 rows at most.
Don’t worry about learning this data_transformers
syntax for the midterms or quizzes.
# Our dataset has about 6500 rows. Be careful allowing too many rows.
alt.data_transformers.enable('default', max_rows=7000)
DataTransformerRegistry.enable('default')
Here we put the days of the week along the x-axis, and we use the special Altair syntax "count()"
to make the heights of the bars equal to the number of entries for that day. Notice how, like we saw above, Friday seems to be the most common.
It’s a little annoying that the days are alphabetized, rather than given in consecutive order, but let’s not worry about trying to fix that for now (it’s definitely possible to fix it).
alt.Chart(df).mark_bar().encode(
x="Day",
y="count()"
)
What happens if we don’t specify the y-channel? Then the bars are all made with a default height.
alt.Chart(df).mark_bar().encode(
x="Day",
)
Instead of "count()"
, we could use another function like "median"
, but in that case, we need to specify what is the value we are taking the median of. Here we take the median of the “tip” column.
Using the tooltip, we can see that the median value is 1.76
for the “tip” column on Tuesdays.
alt.Chart(df).mark_bar().encode(
x="Day",
y="median(tip)",
tooltip=["median(tip)"]
)
What is the median tip amount for Tuesday?
Here we get the sub-DataFrame.
df_sub = df[df["Day"] == "Tuesday"]
Notice how the values seem to all be on Tuesday.
df_sub.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | Day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 2019-03-26 21:07:31 | 2019-03-26 21:17:29 | 1 | 3.65 | 13.0 | 2.00 | 0.00 | 18.80 | yellow | credit card | Battery Park City | Two Bridges/Seward Park | Manhattan | Manhattan | Tuesday |
13 | 2019-03-19 07:55:25 | 2019-03-19 08:09:17 | 1 | 1.75 | 10.5 | 0.00 | 0.00 | 13.80 | yellow | cash | Lincoln Square West | Times Sq/Theatre District | Manhattan | Manhattan | Tuesday |
56 | 2019-03-05 17:57:00 | 2019-03-05 18:45:25 | 2 | 19.59 | 52.0 | 0.00 | 5.76 | 65.56 | yellow | credit card | JFK Airport | Upper East Side North | Queens | Manhattan | Tuesday |
64 | 2019-03-26 10:14:55 | 2019-03-26 10:26:58 | 6 | 1.38 | 9.0 | 3.08 | 0.00 | 15.38 | yellow | credit card | Midtown South | Midtown Center | Manhattan | Manhattan | Tuesday |
72 | 2019-03-12 15:44:53 | 2019-03-12 16:02:05 | 1 | 1.98 | 11.5 | 2.96 | 0.00 | 17.76 | yellow | credit card | Yorkville West | Upper West Side North | Manhattan | Manhattan | Tuesday |
Here we compute the median
of the tip column. (Notice how above we only saw 5 rows, but that is because we called the head()
method. In the following cell, we aren’t calling the head()
method, so our DataFrame has many more rows.)
We get the same value here, but it’s also possible for there to be different conventions between pandas and Altair. For example, maybe Altair would count missing values as 0 (just hypothetically) and maybe pandas would ignore them. But in this case, we do get the same output.
df_sub["tip"].median()
1.76
For each day of the week, what was the median tip for that day? Put the results into a pandas Series using
groupby
.
Can you answer the same thing using Altair?
For each day of the week, print out a string stating the median tip amount for that day. Use
:.2f
in the string formatting part so that only two decimal places get printed.