Worksheet 4
Contents
Worksheet 4#
This worksheet is due Tuesday of Week 3, before discussion section. You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It’s fine for everyone in the group to upload the same file.)
Recommendation. Follow the Worksheet 0 instructions to form a free Pro workspace for you and your groupmates, so you can all work on the same file.
These questions refer to the attached stock market-related dataset, indexData.csv.
Hint/Warning 1. We’re less explicit than usual in this file about what names to give the DataFrames. You can either rename each new version of the DataFrame, or make the changes “inplace”.
Hint/Warning 2. You will probably get some warnings showing up, saying something like the following.
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame…
To avoid those warnings, you should add a .copy()
to the end of an expression where you are removing data from a DataFrame. (Lots of times, the expression needing .copy()
is earlier in your code, not the particular cell that raised the warning.)
Questions#
The goal of this worksheet is to create a csv file containing some normalized data from the indexData.csv file. Load that file.
Rename the “Name” column to “Abbreviation”.
Make a column of datetime values from the “Date” column, by using the same procedure as in Worksheet 3. (It’s not as simple as calling
pd.to_datetime(df["Date"])
, because in this particular case, a date like 1/1/65 will get interpreted as being in the year 2065.)
Remove all rows from before “1/1/2000” using Boolean indexing.
You can create a Boolean Series indicating which rows have missing data in a DataFrame
df
by evaluatingdf.isna().any(axis=1)
. (Theaxis=1
keyword argument says to look at each row individually. If instead you usedaxis=0
, it would look at each column individually.) Using Boolean indexing and the elementwise negation of the previous Boolean Series, remove the rows which have missing data.
Keep only 3 columns in the DataFrame: The “Abbreviation” column, the column with the datetime values, and the “Open” column.
Check your answer. The current DataFrame should have 69540 rows and 3 columns.
Create a pandas GroupBy object from this DataFrame, grouping by the “Abbreviation” column. Call the
first
method on this GroupBy object and display the resulting DataFrame. (Ask yourself: what does the displayed DataFrame represent?)
One of the stock exchanges appears to be an outlier, in the sense that its data starts much later than the data of the other stock exchanges. Remove that stock exchange from our DataFrame using Boolean indexing.
Check your answer. The new DataFrame should have 67194 rows and 3 columns.
An issue with our data is that different stock exchanges are difficult to compare. (For example, the currencies are different.) Create a new column called “NormOpen” that initially has the same values as the “Open” column.
(More difficult.) For each stock exchange, rescale the “NormOpen” column (multiply by a suitable constant) so that the initial value is (up to numerical precision issues) 1000. (So you should multiply by a different constant for each stock exchange. Use the
groupby
call that you made above to find that constant. It’s fine to type1000
, but be sure you’re not typing out the rescaling constants explicitly.)
Check your answer. If you call
.groupby("Abbreviation").mean()
on your DataFrame, the Shanghai stock exchange, with abbreviation “000001.SS”, should have an “Open” mean of approximately 2510 and a “NormOpen” mean of approximately 1834. The “NormOpen” mean should be interpreted as being approximately 83% higher than in January 2000.
Save your DataFrame to a csv file using the DataFrame method
to_csv
. Use the keyword argumentindex=False
so that the index of the DataFrame is not included in the csv file.
Upload this csv file to Canvas as your Worksheet 4 submission.