Worksheet

You may submit this worksheet in a group of 1-3 total students.

These worksheets are graded for effort, not for correctness.

Due date. Due at 5:00pm on Tuesday of Week 3.

Question 0:

Name(s):

UCI ID(s):

# If this doesn't work, be sure Altair is installed.
# Ask for help if you can't get it to work.

import pandas as pd
import altair as alt

Setup

Question 1:

The following file path will not work for you. Download the udemy.csv file from the Files->Datasets folder on Canvas, and replace the file path below with a path that works on your computer.

df = pd.read_csv("../data/udemy.csv")
## Your code here

Question 2:

By default, Altair only allows datasets with 5000 or fewer rows. Replace df with a smaller dataframe that has at least 1000 and at most 5000 rows. Use .copy() after your code to make sure you have an actual dataframe, and not a “view” of a dataframe.

The original source of the dataset is Kaggle. You can go there to see a description of the columns.

For now, you could just take the first 5000 rows, but eventually, you might want to select rows using a more meaningful condition.

## Your code here

Plots

Here is an example of making a (boring) chart in Altair. We use the “id” column in the dataframe for both the x and y axes. We specify that the points should be joined by a line, and we specify a width and a height of the chart (in pixels).

alt.Chart(df).mark_line().encode(
    x = 'id',
    y = 'id'
).properties(
    width = 800,
    height = 400
)
---------------------------------------------------------------------------
MaxRowsError                              Traceback (most recent call last)
~/opt/anaconda3/envs/book/lib/python3.8/site-packages/altair/vegalite/v4/api.py in to_dict(self, *args, **kwargs)
    361         copy = self.copy(deep=False)
    362         original_data = getattr(copy, "data", Undefined)
--> 363         copy.data = _prepare_data(original_data, context)
    364 
    365         if original_data is not Undefined:

~/opt/anaconda3/envs/book/lib/python3.8/site-packages/altair/vegalite/v4/api.py in _prepare_data(data, context)
     82     # convert dataframes  or objects with __geo_interface__ to dict
     83     if isinstance(data, pd.DataFrame) or hasattr(data, "__geo_interface__"):
---> 84         data = _pipe(data, data_transformers.get())
     85 
     86     # convert string input to a URLData

~/opt/anaconda3/envs/book/lib/python3.8/site-packages/toolz/functoolz.py in pipe(data, *funcs)
    625     """
    626     for func in funcs:
--> 627         data = func(data)
    628     return data
    629 

~/opt/anaconda3/envs/book/lib/python3.8/site-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
    301     def __call__(self, *args, **kwargs):
    302         try:
--> 303             return self._partial(*args, **kwargs)
    304         except TypeError as exc:
    305             if self._should_curry(args, kwargs, exc):

~/opt/anaconda3/envs/book/lib/python3.8/site-packages/altair/vegalite/data.py in default_data_transformer(data, max_rows)
     17 @curried.curry
     18 def default_data_transformer(data, max_rows=5000):
---> 19     return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)
     20 
     21 

~/opt/anaconda3/envs/book/lib/python3.8/site-packages/toolz/functoolz.py in pipe(data, *funcs)
    625     """
    626     for func in funcs:
--> 627         data = func(data)
    628     return data
    629 

~/opt/anaconda3/envs/book/lib/python3.8/site-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
    301     def __call__(self, *args, **kwargs):
    302         try:
--> 303             return self._partial(*args, **kwargs)
    304         except TypeError as exc:
    305             if self._should_curry(args, kwargs, exc):

~/opt/anaconda3/envs/book/lib/python3.8/site-packages/altair/utils/data.py in limit_rows(data, max_rows)
     78             return data
     79     if max_rows is not None and len(values) > max_rows:
---> 80         raise MaxRowsError(
     81             "The number of rows in your dataset is greater "
     82             "than the maximum allowed ({}). "

MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000). For information on how to plot larger datasets in Altair, see the documentation
alt.Chart(...)

Question 3:

Make the following changes, to produce a more interesting chart.

  • Choose new DataFrame columns, not the id column.

  • Change from mark_line to something else. Options: Marks

  • Add another encoding channel. The most obvious choice is color, but other options include opacity and size. You can see a list of channels here: Documentation.

  • Here is a list of color schemes: Vega color schemes and an example of how to use a color scheme in Altair.

  • If you find browsing examples easier than reading documentation, check out the Altair example gallery

## Your code here

Question 4:

Add a tooltip to your chart, indicating some relevant information about the course, like its title.
## Your code here

Next steps

If you finish the above before class is over, here are some things you can work on. None of these are required for submitting the worksheet, but they will definitely help you with this week’s homework.

  • Read the homework assignment for this week. Notice it involves this same Udemy dataset. You can start working on it.

  • Find an interesting way to choose rows from the dataset, rather than just choosing the first 5000.

  • Practice with list comprehension. For example, if my_series is a column in a pandas DataFrame, you can use syntax like [x for x in my_series if ...] to get all the entries satisfying some condition. If you do this, you should also ask yourself if there was a way to do the same thing in pandas directly. For example

[x for x in my_series if x > 10]

is probably much less efficient than

my_series[my_series > 10]

The only real difference between these is that one produces a list and the other produces a pandas Series.

  • Practice with lambda functions. For example, if my_series is a column in a pandas DataFrame, you can apply some simple function to each entry in my_series by using my_series.map(lambda x: ...). If you do this, you should also ask yourself if there was a way to do the same thing more simply. For example,

my_series.map(lambda x: x+10)

is equivalent to

my_series + 10
  • Solve one of the following questions to get more practice with Altair.

Question 5:

Optional

Use pd.to_datetime to convert the published_time column into a datetime format that pandas (and Altair) recognize. Make a chart in Altair using the new time column for the x-axis. You might (or not) need to specify :T after the column name so that Altair knows this is a time unit, like this: x = 'colname:T'.

## Your code here

Question 6:

Optional

Add a brush = alt.selection_interval() to your chart, taking inspiration from this interactive average example.

## Your code here