Exploratory Data Analysis¶

In this worksheet, we’ll investigate the indexData.csv file using the techniques from class so far as well as some new methods. These questions are designed to help with Homework 1 (due Tuesday night).

Start by importing the pandas library and abbreviating it by pd.
Read in the contents of indexData.csv and save the result with the variable name df.

import pandas as pd
df = pd.read_csv("../data/indexData.csv")

describe and info¶

Change the title of this section from “Section 1” to “describe and info”. Be sure it still remains a heading.
Evaluate df.describe().

df.describe()

	Open	High	Low	Close	Adj Close	Volume
count	110253.000000	110253.000000	110253.000000	110253.000000	110253.000000	1.102530e+05
mean	7658.515222	7704.372961	7608.000422	7657.545872	7657.351729	1.273975e+09
std	9011.478913	9066.638548	8954.506981	9011.510444	9011.608900	4.315783e+09
min	54.869999	54.869999	54.869999	54.869999	54.869999	0.000000e+00
25%	1855.030029	1864.510010	1843.979980	1855.060059	1854.179565	0.000000e+00
50%	5194.089844	5226.279785	5154.049805	5194.750000	5194.750000	4.329000e+05
75%	10134.299810	10207.820310	10060.349610	10134.830080	10134.830080	1.734314e+08
max	68775.062500	69403.750000	68516.992190	68775.062500	68775.062500	9.440374e+10

This df.describe() isn’t very useful in this case because it combines values for many different stock exchanges.

Make a new DataFrame dfny containing only the New York stock exchange info “NYA”. (Use Boolean indexing.)
Check your answer: the shape of dfny should be (13948, 8).
Call the describe method on dfny.
What is the mean value in the “Close” column? What is the median value in the “Close” column? (Do you find the difference surprising?)

dfny = df[df["Name"]=="NYA"]
dfny.shape

(13948, 8)

dfny.describe()

	Open	High	Low	Close	Adj Close	Volume
count	13947.000000	13947.000000	13947.000000	13947.000000	13947.000000	1.394700e+04
mean	4451.778151	4468.236552	4433.695014	4452.174711	4452.174711	1.215565e+09
std	4074.953735	4095.166918	4052.801658	4075.459366	4075.459366	1.834155e+09
min	347.769989	347.769989	347.769989	347.769989	347.769989	0.000000e+00
25%	654.940002	654.940002	654.940002	654.940002	654.940002	0.000000e+00
50%	2631.800049	2631.800049	2631.800049	2631.800049	2631.800049	0.000000e+00
75%	7339.489990	7375.719971	7276.889893	7337.935059	7337.935059	2.681975e+09
max	16590.429690	16685.890630	16531.949220	16590.429690	16590.429690	1.145623e+10

We can see that the mean value of “Close” is approximately 4452 and the median value is approximately 2631. This is an example where mean and median are very different.

I personally find this describe info easier to read if the DataFrame is transposed. Use .T to transpose the data.

dfny.describe().T

	count	mean	std	min	25%	50%	75%	max
Open	13947.0	4.451778e+03	4.074954e+03	347.769989	654.940002	2631.800049	7.339490e+03	1.659043e+04
High	13947.0	4.468237e+03	4.095167e+03	347.769989	654.940002	2631.800049	7.375720e+03	1.668589e+04
Low	13947.0	4.433695e+03	4.052802e+03	347.769989	654.940002	2631.800049	7.276890e+03	1.653195e+04
Close	13947.0	4.452175e+03	4.075459e+03	347.769989	654.940002	2631.800049	7.337935e+03	1.659043e+04
Adj Close	13947.0	4.452175e+03	4.075459e+03	347.769989	654.940002	2631.800049	7.337935e+03	1.659043e+04
Volume	13947.0	1.215565e+09	1.834155e+09	0.000000	0.000000	0.000000	2.681975e+09	1.145623e+10

For example, this tells us that 75% of the values in the “Close” column are below 7337.935, and 25% are above. Check this using Boolean indexing and then finding the length len(???) of the resulting DataFrame (len gives the number of rows).

len(dfny[dfny["Close"]<7337.935])/len(dfny)

0.7499283051333524

What is some information visible in the result of df.info()?

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112457 entries, 0 to 112456
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   Name       112457 non-null  object 
 1   Date       112457 non-null  object 
 2   Open       110253 non-null  float64
 3   High       110253 non-null  float64
 4   Low        110253 non-null  float64
 5   Close      110253 non-null  float64
 6   Adj Close  110253 non-null  float64
 7   Volume     110253 non-null  float64
dtypes: float64(6), object(2)
memory usage: 6.9+ MB

dfny shows lots of information. For example, it says which columns have data type float.

Do you see how to count the number of missing values (or null values) in the “Open” column using the result of df.info()? Does your result match the result of df.Open.isna().sum(), which is the “usual” way to count missing values?

# Missing values
112457-110253

df.Open.isna().sum()

value_counts¶

Define a new DataFrame dfhk containing only the info from the Hong Kong stock index, “HSI”. (Check: the shape of dfhk should be 8750 rows by 8 columns.)

dfhk = df[df["Name"]=="HSI"].copy()

Evaluate dfhk["Close"].value_counts(). This will tell you, for example, that 3066.00000 occurred 5 times in the “Close” column, and no other value occurred more than 3 times.

dfhk["Close"].value_counts()

000000     5
500000     3
000000     3
000000     3
000000     3
               ..
120117     1
389648     1
419922     1
799805     1
800780    1
Name: Close, Length: 8336, dtype: int64

It might seem difficult to use value_counts on this data, because there isn’t much repetition. Here is a nice way.

Create a new column using the line dfhk["Diff"] = dfhk["High"] - dfhk["Low"]. If you get a pandas warning “A value is trying to be set on a copy of a slice from a DataFrame”, go back to the line where you defined dfhk and add a .copy() at the end of the line.

dfhk["Diff"] = dfhk["High"] - dfhk["Low"]

Look at the top 4 rows of dfhk, and notice how there is now a ninth column on the right. Try to access these rows two different ways: using iloc[:4] (this is called slicing) and using head(4).

dfhk.iloc[:4]

	Name	Date	Open	High	Low	Close	Adj Close	Volume	Diff
26638	HSI	12/31/86	2568.300049	2568.300049	2568.300049	2568.300049	2568.300049	0.0	0.0
26639	HSI	1/1/87	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26640	HSI	1/2/87	2540.100098	2540.100098	2540.100098	2540.100098	2540.100098	0.0	0.0
26641	HSI	1/5/87	2552.399902	2552.399902	2552.399902	2552.399902	2552.399902	0.0	0.0

dfhk.head(4)

	Name	Date	Open	High	Low	Close	Adj Close	Volume	Diff
26638	HSI	12/31/86	2568.300049	2568.300049	2568.300049	2568.300049	2568.300049	0.0	0.0
26639	HSI	1/1/87	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26640	HSI	1/2/87	2540.100098	2540.100098	2540.100098	2540.100098	2540.100098	0.0	0.0
26641	HSI	1/5/87	2552.399902	2552.399902	2552.399902	2552.399902	2552.399902	0.0	0.0

Make a new column called “Volatility” (I’m not sure if this is the technical definition, so I’m just using the word informally) which is the “Diff” column divided by the “Open” column.

dfhk["Volatility"] = dfhk["Diff"]/dfhk["Open"]

Write a markdown cell explaining in words what the “Volatility” column represents.

This “Volatility” column indicates what proportion of the opening value the stocks swung during the day.

Check the value_counts for this column using dfhk["Volatility"].value_counts(). It looks pretty useless.

dfhk["Volatility"].value_counts()

000000    576
094471      2
007531      2
005545      2
014085      2
           ... 
010021      1
026140      1
010504      1
018395      1
010810      1
Name: Volatility, Length: 7909, dtype: int64

But here’s a nice trick. If we compute dfhk["Volatility"].round(2), what is the result?

This rounds each to two decimal places.

dfhk["Volatility"].round(2)

  0.00
   NaN
  0.00
  0.00
  0.00
         ... 
  0.02
  0.01
  0.01
  0.01
  0.01
Name: Volatility, Length: 8750, dtype: float64

Now apply value_counts to that rounded result.

dfhk["Volatility"].round(2).value_counts()

01    4876
02    1974
00     826
03     533
04     160
05      59
06      28
07      14
08       9
09       7
11       2
13       2
17       1
10       1
Name: Volatility, dtype: int64

Answer the following quesiton in a markdown cell: How many days in the dataset did the Hong Kong stock market move 6%?

According to this information, it seems like on 28 days, the stock market moved 6% during the day.

Created in Deepnote

UC Irvine Math 10 S22

Exploratory Data Analysis

Contents

Exploratory Data Analysis¶

describe and info¶

value_counts¶