Worksheet 3
Contents
Worksheet 3#
Authors (3 maximum): BLANK
This worksheet is due Tuesday of Week 3, before discussion section. You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It’s fine for everyone in the group to upload the same file.)
Recommendation. Follow the Worksheet 0 instructions to form a free Pro workspace for you and your groupmates, so you can all work on the same file.
These questions refer to the two attached stock market-related datasets, indexData.csv and indexInfo.csv.
Importing the data#
Import the attached “indexInfo.csv” file and name the result
df_pre
.Import the attached “indexData.csv” file and name the result
df
.Look at the top few rows of these DataFrames to get a sense for their contents.
Adjusting the DataFrame from indexInfo.csv#
Define a dictionary
d
so that the following code will rename the “Index” column as “Abbreviation” and will rename the “Exchange” column as “Full Name”, and save the resulting DataFrame with the variable namedf_info
. Theaxis=1
part is saying to change a column label, not a row label. Documentation
df_info = df_pre.rename(d, axis=1)
Use the
set_index
method (documentation) to set the “Abbreviation” column as the new index of this DataFrame.Use the
inplace
keyword argument to make the change todf_info
itself (rather than outputting a new DataFrame). (Warning. You can’t execute this cell twice, because the “Abbreviation” column will disappear when you move it to the index.)
Using a
lambda
function, write a functionfull_name
which takes as input an abbreviation string like “N225” and as output returns the full name, like “Tokyo Stock Exchange”. As another example,full_name("NYA")
should be equal to “New York Stock Exchange”. (Hint. Figure out how to usedf_info
. This question will be much easier because you set theindex
in the previous part.)
Processing dates#
The “Date” column of df
contains strings like “12/31/65”, which represents “12/31/1965”, as well as strings like “6/2/21”, which represents “6/2/2021”.
Write a function
fix_date
which takes as input strings like “12/31/65” and “6/2/21”, and as output returns strings like, for example, “12/31/1965” and “6/2/2021”. (Hint. Ifs
is a string, thens[-2:]
represents the last two characters, ands[:-2]
represents everything but the last two characters. Notice that there are no dates in the “Date” column from before 1960, so if something like “1/1/10” appears, you can be sure that it stands for 2010 and not 1910. Try to make the code for yourfix_date
function DRY (Don’t Repeat Yourself), including as little repetition as possible. Don’t force yourself to use alambda
function.)
Use the pandas Series method
map
to applyfix_date
(you don’t need a lambda function, becausefix_date
is already a function) to all of the entries in the “Date” column, and then applypd.to_datetime
to the resulting Series. (Make surepd.to_datetime
is outside of themap
… we want to apply it to the whole Series, not to the individual entries.) Insert the result as a new column indf
, named “Date2”.
Check your work. If you evaluate
df.dtypes
, you should see that the “Date” column has “object” as its dtype (which in this case means string) and that the “Date2” column hasdatetime64[ns]
(or something similar) as its dtype.
If we instead try to use a for loop, this will be much slower.
First create a new column named “Date3” filled with today’s date (as a datetime or Timestamp object, not as a string) in every position.
Now run the following code (replacing the
???
with an appropriate expression). Even though this is only setting 5000 values, as opposed to the previous step, which set over 100,000 values, this will still take a few seconds. Time how long this code takes to execute by putting%%time
at the top of the Deepnote cell.
for i in ???[:5000]:
df.loc[i, "Date3"] = pd.to_datetime(fix_date(df.loc[i, "Date"]))
Printing information from indexData.csv#
Using a for-loop and an f-string, for each distinct value in the “Name” column of
df
, print the analogue of the following string (so you should print exactly this string for"NYA"
):
There are 13948 rows for the New York Stock Exchange (abbreviation: NYA) in the dataset. The earliest date for the New York Stock Exchange in the dataset is Friday, 1965-12-31.
Comments: Include a blank line between the strings for different stock exchanges to make it easier to read. If you want to have a Python string that includes multiple lines, you can use triple apostrophes, like '''Here could be a string with line breaks'''
. Use Boolean indexing to get the corresponding sub-DataFrame (later we’ll see how to do the same thing using groupby
, but it’s important to be able to also do this directly using Boolean indexing). Be sure to include the day of the week, the full name of the stock exchange, and the number of rows. (Use the full_name
you wrote above to get the full names of the stock exchanges.)
Reminder#
Every group member needs to submit this on Canvas (even if you all submit the same link).
Be sure you’ve included the names of you and your group members at the top after “Authors”. You might consider restarting this notebook (the Deepnote refresh icon, not your browser refresh icon) and running it again to make sure it’s working.
Submission#
Using the Share & publish menu at the top right, enable public sharing in the “Share project” section, and enable Comment privileges. Then submit that link on Canvas.
Created in Deepnote