Worksheet 7#
You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)
Preparing the data#
Import the attached Spotify dataset as
df
using the pandas functionread_csv
.
Look at the first few rows in the DataFrame. Which columns look like they should be numeric?
Evaluate
df.dtypes
, and notice that only a few of these columns are actually numeric.
The problem is that missing values are specified in an unusual way in this dataset, and pandas does not recognize them as missing.
Evaluate the function
pd.to_numeric
on the “Energy” column indf
. Look at the error message. It is telling you what the problematic values are.
(There is a better way to do this, using df.replace
, but the approach suggested here gives some practice with applymap
, with lambda functions, and with the Python operator ??? if ??? else ???
.)
Using the following template, replace these problematic values with the Not a Number value defined in NumPy. If the input
x
is not missing, it should be left unchanged.
df2 = df.applymap(lambda x: ??? if x == ??? else ???)
Define
first_col = "Popularity"
andlast_col = "Valence"
.
Using
apply
and a suitableaxis
argument, apply thepd.to_numeric
function to each column fromfirst_col
tolast_col
.
(Comment 0. Use df2.loc[???] = df2.loc[???].apply(???, axis=???)
so that the values in df2
get changed. Comment 1. Slicing with loc
is one of the few examples in Python where the right endpoint is included. Comment 2. Neither the column labels nor the row labels are getting changed in this case, so my explanation for whether to use axis=0
or axis=1
does not quite apply. But imagine we were applying sum
instead of pd.to_numeric
, that would collapse the whole column down to a single number, and so the row labels would be changing.)
Import the
is_numeric_dtype
andis_string_dtype
functions frompandas.api.types
. (You can do these both on the same line, separating the function names with commas.)
Using list comprehension and the function
is_numeric_dtype
, make a list of all the column names indf2
which have a numeric dtype now. Name this listnum_list
. (This listnum_list
should be a list of strings, not a list of pandas Series.)
Check your answer:
num_list
should have length13
and thetype
ofnum_list[3]
should bestr
.
Drop the rows which contain missing values, using the code
df3 = df2.dropna(axis=???)
.
Check your answer. The DataFrame
df3
should have1545
rows and23
columns.
Plotting the data#
Write a function
make_chart
which takes as input the name of a columnc
indf3
, and as output returns an Altair scatter plot chart using"Energy"
for the x-axis, using columnc
for the y-axis, and using"Danceability"
for color.
Adjust the color scale to something more colorful in your definition of
make_chart
. You can either use the old syntaxscale=alt.Scale(scheme=???)
or the new syntax.scale(scheme=???)
, but if you want to use the new syntax, make sure you pip install the most recent version of Altair before importing Altair (you can always restart the notebook if you need to). Options for color scheme
Using list comprehension and the list
num_list
you made above, make a new listchart_list
which containsmake_chart(c)
for everyc
innum_list
.
Use the following code to make a new Altair chart, which has all of the previous charts vertically concatenated together.
(Comment. This *
before chart_list
is performing what is called “list unpacking”. Rather than giving alt.vconcat
a single input value, a list, we are giving it 13 separate input values. In other words, alt.vconcat(*chart_list)
is the same as alt.vconcat(chart_list[0], chart_list[1], ..., chart_list[12])
.)
total_chart = alt.vconcat(*chart_list)
Display this
vconcat
chart (just evaluatetotal_chart
in its own cell) and make sure it looks reasonable. You should see 13 distinct Altair charts, arranged vertically.
Sample quiz or midterm questions. (You don’t need to answer these, but they are good practice questions.)
Why does one of the charts show just a diagonal line?
Why do the colors in one of the charts look so much more regular than in the others?
If you wanted to know which song was represented at the very top-left of the “Speechiness” chart, how would you find it?
What is a major difference between the chart produced above and a facet chart?
Submission#
Save the chart as a json file using the following code and upload that json file to Canvas.
with open("total_chart_file.json", "w") as f:
f.write(total_chart.to_json())
If you want to check that the json file is correct, open it locally on your computer (it is too big to open on Deepnote) and copy-paste the contents into the Vega Editor then click
Run
at the top.
Reminder#
Every group member needs to submit this on Canvas (even if you all submit the same file).