Worksheet 7

Worksheet 7#

You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)

Preparing the data#

Import the attached Spotify dataset as df using the pandas function read_csv.

Look at the first few rows in the DataFrame. Which columns look like they should be numeric?

Evaluate df.dtypes, and notice that only a few of these columns are actually numeric.

The problem is that missing values are specified in an unusual way in this dataset, and pandas does not recognize them as missing.

Evaluate the function pd.to_numeric on the “Energy” column in df. Look at the error message. It is telling you what the problematic values are.

(There is a better way to do this, using df.replace, but the approach suggested here gives some practice with applymap, with lambda functions, and with the Python operator ??? if ??? else ???.)

Using the following template, replace these problematic values with the Not a Number value defined in NumPy. If the input x is not missing, it should be left unchanged.

df2 = df.applymap(lambda x: ??? if x == ??? else ???)

Define first_col = "Popularity" and last_col = "Valence".

Using apply and a suitable axis argument, apply the pd.to_numeric function to each column from first_col to last_col.

(Comment 0. Use df2.loc[???] = df2.loc[???].apply(???, axis=???) so that the values in df2 get changed. Comment 1. Slicing with loc is one of the few examples in Python where the right endpoint is included. Comment 2. Neither the column labels nor the row labels are getting changed in this case, so my explanation for whether to use axis=0 or axis=1 does not quite apply. But imagine we were applying sum instead of pd.to_numeric, that would collapse the whole column down to a single number, and so the row labels would be changing.)

Import the is_numeric_dtype and is_string_dtype functions from pandas.api.types. (You can do these both on the same line, separating the function names with commas.)

Using list comprehension and the function is_numeric_dtype, make a list of all the column names in df2 which have a numeric dtype now. Name this list num_list. (This list num_list should be a list of strings, not a list of pandas Series.)

Check your answer: num_list should have length 13 and the type of num_list[3] should be str.

Drop the rows which contain missing values, using the code df3 = df2.dropna(axis=???).

Check your answer. The DataFrame df3 should have 1545 rows and 23 columns.

Plotting the data#

Write a function make_chart which takes as input the name of a column c in df3, and as output returns an Altair scatter plot chart using "Energy" for the x-axis, using column c for the y-axis, and using "Danceability" for color.

Adjust the color scale to something more colorful in your definition of make_chart. You can either use the old syntax scale=alt.Scale(scheme=???) or the new syntax .scale(scheme=???), but if you want to use the new syntax, make sure you pip install the most recent version of Altair before importing Altair (you can always restart the notebook if you need to). Options for color scheme

Using list comprehension and the list num_list you made above, make a new list chart_list which contains make_chart(c) for every c in num_list.

Use the following code to make a new Altair chart, which has all of the previous charts vertically concatenated together.

(Comment. This * before chart_list is performing what is called “list unpacking”. Rather than giving alt.vconcat a single input value, a list, we are giving it 13 separate input values. In other words, alt.vconcat(*chart_list) is the same as alt.vconcat(chart_list[0], chart_list[1], ..., chart_list[12]).)

total_chart = alt.vconcat(*chart_list)

Display this vconcat chart (just evaluate total_chart in its own cell) and make sure it looks reasonable. You should see 13 distinct Altair charts, arranged vertically.

Sample quiz or midterm questions. (You don’t need to answer these, but they are good practice questions.)

Why does one of the charts show just a diagonal line?
Why do the colors in one of the charts look so much more regular than in the others?
If you wanted to know which song was represented at the very top-left of the “Speechiness” chart, how would you find it?
What is a major difference between the chart produced above and a facet chart?

Submission#

Save the chart as a json file using the following code and upload that json file to Canvas.

with open("total_chart_file.json", "w") as f:
    f.write(total_chart.to_json())

If you want to check that the json file is correct, open it locally on your computer (it is too big to open on Deepnote) and copy-paste the contents into the Vega Editor then click Run at the top.

Reminder#

Every group member needs to submit this on Canvas (even if you all submit the same file).