Worksheet 3#

Authors (3 maximum): BLANK

This worksheet is due Tuesday of Week 3, before discussion section. You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It’s fine for everyone in the group to upload the same file.)

Recommendation. Follow the Worksheet 0 instructions to form a free Pro workspace for you and your groupmates, so you can all work on the same file.

These questions refer to the two attached stock market-related datasets, indexData.csv and indexInfo.csv.

Importing the data#

  • Import the attached “indexInfo.csv” file and name the result df_pre.

  • Import the attached “indexData.csv” file and name the result df.

  • Look at the top few rows of these DataFrames to get a sense for their contents.

Adjusting the DataFrame from indexInfo.csv#

  • Define a dictionary d so that the following code will rename the “Index” column as “Abbreviation” and will rename the “Exchange” column as “Full Name”, and save the resulting DataFrame with the variable name df_info. The axis=1 part is saying to change a column label, not a row label. Documentation

df_info = df_pre.rename(d, axis=1)
  • Use the set_index method (documentation) to set the “Abbreviation” column as the new index of this DataFrame.

  • Use the inplace keyword argument to make the change to df_info itself (rather than outputting a new DataFrame). (Warning. You can’t execute this cell twice, because the “Abbreviation” column will disappear when you move it to the index.)

  • Using a lambda function, write a function full_name which takes as input an abbreviation string like “N225” and as output returns the full name, like “Tokyo Stock Exchange”. As another example, full_name("NYA") should be equal to “New York Stock Exchange”. (Hint. Figure out how to use df_info. This question will be much easier because you set the index in the previous part.)

Processing dates#

The “Date” column of df contains strings like “12/31/65”, which represents “12/31/1965”, as well as strings like “6/2/21”, which represents “6/2/2021”.

  • Write a function fix_date which takes as input strings like “12/31/65” and “6/2/21”, and as output returns strings like, for example, “12/31/1965” and “6/2/2021”. (Hint. If s is a string, then s[-2:] represents the last two characters, and s[:-2] represents everything but the last two characters. Notice that there are no dates in the “Date” column from before 1960, so if something like “1/1/10” appears, you can be sure that it stands for 2010 and not 1910. Try to make the code for your fix_date function DRY (Don’t Repeat Yourself), including as little repetition as possible. Don’t force yourself to use a lambda function.)

  • Use the pandas Series method map to apply fix_date (you don’t need a lambda function, because fix_date is already a function) to all of the entries in the “Date” column, and then apply pd.to_datetime to the resulting Series. (Make sure pd.to_datetime is outside of the map… we want to apply it to the whole Series, not to the individual entries.) Insert the result as a new column in df, named “Date2”.

  • Check your work. If you evaluate df.dtypes, you should see that the “Date” column has “object” as its dtype (which in this case means string) and that the “Date2” column has datetime64[ns] (or something similar) as its dtype.

If we instead try to use a for loop, this will be much slower.

  • First create a new column named “Date3” filled with today’s date (as a datetime or Timestamp object, not as a string) in every position.

  • Now run the following code (replacing the ??? with an appropriate expression). Even though this is only setting 5000 values, as opposed to the previous step, which set over 100,000 values, this will still take a few seconds. Time how long this code takes to execute by putting %%time at the top of the Deepnote cell.

for i in ???[:5000]:
    df.loc[i, "Date3"] = pd.to_datetime(fix_date(df.loc[i, "Date"]))

Printing information from indexData.csv#

  • Using a for-loop and an f-string, for each distinct value in the “Name” column of df, print the analogue of the following string (so you should print exactly this string for "NYA"):

There are 13948 rows for the New York Stock Exchange (abbreviation: NYA) in the dataset. The earliest date for the New York Stock Exchange in the dataset is Friday, 1965-12-31.

Comments: Include a blank line between the strings for different stock exchanges to make it easier to read. If you want to have a Python string that includes multiple lines, you can use triple apostrophes, like '''Here could be a string with line breaks'''. Use Boolean indexing to get the corresponding sub-DataFrame (later we’ll see how to do the same thing using groupby, but it’s important to be able to also do this directly using Boolean indexing). Be sure to include the day of the week, the full name of the stock exchange, and the number of rows. (Use the full_name you wrote above to get the full names of the stock exchanges.)

Reminder#

Every group member needs to submit this on Canvas (even if you all submit the same link).

Be sure you’ve included the names of you and your group members at the top after “Authors”. You might consider restarting this notebook (the Deepnote refresh icon, not your browser refresh icon) and running it again to make sure it’s working.

Submission#

Using the Share & publish menu at the top right, enable public sharing in the “Share project” section, and enable Comment privileges. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote