Worksheet¶
You may submit this worksheet in a group of 1-3 total students.
These worksheets are graded for effort, not for correctness.
Due date. Due at 5:00pm on Tuesday of Week 3.
Question 0:
Name(s):
UCI ID(s):
# If this doesn't work, be sure Altair is installed.
# Ask for help if you can't get it to work.
import pandas as pd
import altair as alt
Setup¶
Question 1:
The following file path will not work for you. Download the udemy.csv
file from the Files->Datasets folder on Canvas, and replace the file path below with a path that works on your computer.
df = pd.read_csv("../data/udemy.csv")
## Your code here
Question 2:
By default, Altair only allows datasets with 5000 or fewer rows. Replace df with a smaller dataframe that has at least 1000 and at most 5000 rows. Use .copy()
after your code to make sure you have an actual dataframe, and not a “view” of a dataframe.
The original source of the dataset is Kaggle. You can go there to see a description of the columns.
For now, you could just take the first 5000 rows, but eventually, you might want to select rows using a more meaningful condition.
## Your code here
Plots¶
Here is an example of making a (boring) chart in Altair. We use the “id” column in the dataframe for both the x and y axes. We specify that the points should be joined by a line, and we specify a width and a height of the chart (in pixels).
alt.Chart(df).mark_line().encode(
x = 'id',
y = 'id'
).properties(
width = 800,
height = 400
)
---------------------------------------------------------------------------
MaxRowsError Traceback (most recent call last)
~/opt/anaconda3/envs/book/lib/python3.8/site-packages/altair/vegalite/v4/api.py in to_dict(self, *args, **kwargs)
361 copy = self.copy(deep=False)
362 original_data = getattr(copy, "data", Undefined)
--> 363 copy.data = _prepare_data(original_data, context)
364
365 if original_data is not Undefined:
~/opt/anaconda3/envs/book/lib/python3.8/site-packages/altair/vegalite/v4/api.py in _prepare_data(data, context)
82 # convert dataframes or objects with __geo_interface__ to dict
83 if isinstance(data, pd.DataFrame) or hasattr(data, "__geo_interface__"):
---> 84 data = _pipe(data, data_transformers.get())
85
86 # convert string input to a URLData
~/opt/anaconda3/envs/book/lib/python3.8/site-packages/toolz/functoolz.py in pipe(data, *funcs)
625 """
626 for func in funcs:
--> 627 data = func(data)
628 return data
629
~/opt/anaconda3/envs/book/lib/python3.8/site-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
301 def __call__(self, *args, **kwargs):
302 try:
--> 303 return self._partial(*args, **kwargs)
304 except TypeError as exc:
305 if self._should_curry(args, kwargs, exc):
~/opt/anaconda3/envs/book/lib/python3.8/site-packages/altair/vegalite/data.py in default_data_transformer(data, max_rows)
17 @curried.curry
18 def default_data_transformer(data, max_rows=5000):
---> 19 return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)
20
21
~/opt/anaconda3/envs/book/lib/python3.8/site-packages/toolz/functoolz.py in pipe(data, *funcs)
625 """
626 for func in funcs:
--> 627 data = func(data)
628 return data
629
~/opt/anaconda3/envs/book/lib/python3.8/site-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
301 def __call__(self, *args, **kwargs):
302 try:
--> 303 return self._partial(*args, **kwargs)
304 except TypeError as exc:
305 if self._should_curry(args, kwargs, exc):
~/opt/anaconda3/envs/book/lib/python3.8/site-packages/altair/utils/data.py in limit_rows(data, max_rows)
78 return data
79 if max_rows is not None and len(values) > max_rows:
---> 80 raise MaxRowsError(
81 "The number of rows in your dataset is greater "
82 "than the maximum allowed ({}). "
MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000). For information on how to plot larger datasets in Altair, see the documentation
alt.Chart(...)
Question 3:
Make the following changes, to produce a more interesting chart.
Choose new DataFrame columns, not the
id
column.Change from
mark_line
to something else. Options: MarksAdd another encoding channel. The most obvious choice is
color
, but other options includeopacity
andsize
. You can see a list of channels here: Documentation.Here is a list of color schemes: Vega color schemes and an example of how to use a color scheme in Altair.
If you find browsing examples easier than reading documentation, check out the Altair example gallery
## Your code here
Question 4:
Add a tooltip to your chart, indicating some relevant information about the course, like its title.## Your code here
Next steps¶
If you finish the above before class is over, here are some things you can work on. None of these are required for submitting the worksheet, but they will definitely help you with this week’s homework.
Read the homework assignment for this week. Notice it involves this same Udemy dataset. You can start working on it.
Find an interesting way to choose rows from the dataset, rather than just choosing the first 5000.
Practice with list comprehension. For example, if
my_series
is a column in a pandas DataFrame, you can use syntax like[x for x in my_series if ...]
to get all the entries satisfying some condition. If you do this, you should also ask yourself if there was a way to do the same thing in pandas directly. For example
[x for x in my_series if x > 10]
is probably much less efficient than
my_series[my_series > 10]
The only real difference between these is that one produces a list and the other produces a pandas Series.
Practice with lambda functions. For example, if
my_series
is a column in a pandas DataFrame, you can apply some simple function to each entry inmy_series
by usingmy_series.map(lambda x: ...)
. If you do this, you should also ask yourself if there was a way to do the same thing more simply. For example,
my_series.map(lambda x: x+10)
is equivalent to
my_series + 10
Solve one of the following questions to get more practice with Altair.
Question 5:
Optional
Use pd.to_datetime
to convert the published_time
column into a datetime format that pandas (and Altair) recognize. Make a chart in Altair using the new time column for the x-axis. You might (or not) need to specify :T
after the column name so that Altair knows this is a time unit, like this: x = 'colname:T'
.
## Your code here
Question 6:
Optional
Add a
brush = alt.selection_interval()
to your chart, taking inspiration from this interactive average example.
## Your code here