Worksheet 8#
You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)
Overview#
This project includes a csv file unemployment.csv
which includes US unemployment rates for various industries between 2000 and 2010.
Load that csv file using
pd.read_csv
and save the resulting DataFrame asdf
.
The background question for this homework is,
How does the time of year affect unemployment in different industries?
Cleaning the data#
Evaluate
df.dtypes
. Notice that this “date” column is not being recognized as containing datetime values.
Convert this column to datetime format using
pd.to_datetime
.
Evaluate
df.dtypes
anddf.dtypes["date"]
to make sure the change has actually occurred withindf
.
Less good strategy, just for practice with map
and lambda functions.
Using the pandas Series method
map
and a lambda function, make a new column"month0"
which contains the numerical month for each date. (For example,4
if the date is in April.)
Side question: why don’t you need to use the dt
accessor here?
Better strategy.
Make a new column
"month"
in the DataFrame which contains the numerical month for each date. Use thedt
accessor but notmap
.
Verify that the “month” and “month0” columns contain the same values. First create a Boolean Series, then call the
all
method.
Evaluate the following. What is it telling us about the presence of missing values in
df
?
df.isna().any(axis=1).any()
Normalizing the data#
Make a pandas Series
mean_ser
containing the average unemployment rate for each industry, using the following code.
mean_ser = df.groupby("industry")["rate"].mean()
Make the analogous pandas Series for standard deviation, and name it
std_ser
.
Write a function
make_norm
which takes as input a row ofdf
(not a row label but the whole row as a pandas Series) and as output returns the normalized unemployment rate, where by “normalized”, mean that you should subtract the mean for that industry and divide by the standard deviation for that industry.
For example, if rate
is 7
, the mean
for the industry is 1.2
and the standard deviation for the industry is 4.3
, then the function should return (7 - 1.2)/4.3
.
Using
apply
, the above functionmake_norm
, and a suitableaxis
argument, for each row indf
, normalize the unemployment rate (so that the mean becomes 1 and the standard deviation becomes 1). We need to useapply
and notmap
here, because we need to know the industry. Put the result in a new column indf
called"norm_rate"
.
Using
groupby
, check that the means for the various industries of this new"norm_rate"
column are all very close to 0, and the standard deviations are all close to 1.
Plotting the data#
Make an Altair chart of this data, using
mark_line
, using “date” for the x-channel, using “norm_rate” for the y-channel, and using “industry” for the color.
Side question: can you recognize the impact of the 2008 financial crisis?
Make the same Altair chart, but change from “date” to “month” for the x-channel.
(It will look like a mess because each industry has the same month repeated many times, corresponding to different years.)
Make the same chart as in the previous cell, but change from
y="norm_rate"
toy="mean(norm_rate)"
. This will replace the many-points-per-industry with a single point per industry. Store the chart with the variable namec
, and display this chart.
Notice that “Government” appears to have the highest average normalized unemployment in July. Using Boolean indexing and
mean
, compute this average directly using pandas, and make sure it matches the value you see in Altair. (It should be approximately 1.3.)
Submission#
Save the chart as a json file using the following code and upload that json file to Canvas.
with open("chart.json", "w") as f:
f.write(c.to_json())