Worksheet 8

Worksheet 8#

You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)

Overview#

This project includes a csv file unemployment.csv which includes US unemployment rates for various industries between 2000 and 2010.

Load that csv file using pd.read_csv and save the resulting DataFrame as df.

The background question for this homework is,

How does the time of year affect unemployment in different industries?

Cleaning the data#

Evaluate df.dtypes. Notice that this “date” column is not being recognized as containing datetime values.

Convert this column to datetime format using pd.to_datetime.

Evaluate df.dtypes and df.dtypes["date"] to make sure the change has actually occurred within df.

Less good strategy, just for practice with map and lambda functions.

Using the pandas Series method map and a lambda function, make a new column "month0" which contains the numerical month for each date. (For example, 4 if the date is in April.)

Side question: why don’t you need to use the dt accessor here?

Better strategy.

Make a new column "month" in the DataFrame which contains the numerical month for each date. Use the dt accessor but not map.

Verify that the “month” and “month0” columns contain the same values. First create a Boolean Series, then call the all method.

Evaluate the following. What is it telling us about the presence of missing values in df?

df.isna().any(axis=1).any()

Normalizing the data#

Make a pandas Series mean_ser containing the average unemployment rate for each industry, using the following code.

mean_ser = df.groupby("industry")["rate"].mean()

Make the analogous pandas Series for standard deviation, and name it std_ser.

Write a function make_norm which takes as input a row of df (not a row label but the whole row as a pandas Series) and as output returns the normalized unemployment rate, where by “normalized”, mean that you should subtract the mean for that industry and divide by the standard deviation for that industry.

For example, if rate is 7, the mean for the industry is 1.2 and the standard deviation for the industry is 4.3, then the function should return (7 - 1.2)/4.3.

Using apply, the above function make_norm, and a suitable axis argument, for each row in df, normalize the unemployment rate (so that the mean becomes 1 and the standard deviation becomes 1). We need to use apply and not map here, because we need to know the industry. Put the result in a new column in df called "norm_rate".

Using groupby, check that the means for the various industries of this new "norm_rate" column are all very close to 0, and the standard deviations are all close to 1.

Plotting the data#

Make an Altair chart of this data, using mark_line, using “date” for the x-channel, using “norm_rate” for the y-channel, and using “industry” for the color.

Side question: can you recognize the impact of the 2008 financial crisis?

Make the same Altair chart, but change from “date” to “month” for the x-channel.

(It will look like a mess because each industry has the same month repeated many times, corresponding to different years.)

Make the same chart as in the previous cell, but change from y="norm_rate" to y="mean(norm_rate)". This will replace the many-points-per-industry with a single point per industry. Store the chart with the variable name c, and display this chart.

Notice that “Government” appears to have the highest average normalized unemployment in July. Using Boolean indexing and mean, compute this average directly using pandas, and make sure it matches the value you see in Altair. (It should be approximately 1.3.)

Submission#

Save the chart as a json file using the following code and upload that json file to Canvas.

with open("chart.json", "w") as f:
    f.write(c.to_json())