Exploratory Data Analysis
Contents
Exploratory Data Analysis¶
In this worksheet, we’ll investigate the indexData.csv file using the techniques from class so far as well as some new methods. These questions are designed to help with Homework 1 (due Tuesday night).
Start by importing the pandas library and abbreviating it by
pd
.Read in the contents of indexData.csv and save the result with the variable name
df
.
import pandas as pd
df = pd.read_csv("../data/indexData.csv")
describe and info¶
Change the title of this section from “Section 1” to “describe and info”. Be sure it still remains a heading.
Evaluate
df.describe()
.
df.describe()
Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|
count | 110253.000000 | 110253.000000 | 110253.000000 | 110253.000000 | 110253.000000 | 1.102530e+05 |
mean | 7658.515222 | 7704.372961 | 7608.000422 | 7657.545872 | 7657.351729 | 1.273975e+09 |
std | 9011.478913 | 9066.638548 | 8954.506981 | 9011.510444 | 9011.608900 | 4.315783e+09 |
min | 54.869999 | 54.869999 | 54.869999 | 54.869999 | 54.869999 | 0.000000e+00 |
25% | 1855.030029 | 1864.510010 | 1843.979980 | 1855.060059 | 1854.179565 | 0.000000e+00 |
50% | 5194.089844 | 5226.279785 | 5154.049805 | 5194.750000 | 5194.750000 | 4.329000e+05 |
75% | 10134.299810 | 10207.820310 | 10060.349610 | 10134.830080 | 10134.830080 | 1.734314e+08 |
max | 68775.062500 | 69403.750000 | 68516.992190 | 68775.062500 | 68775.062500 | 9.440374e+10 |
This df.describe()
isn’t very useful in this case because it combines values for many different stock exchanges.
Make a new DataFrame
dfny
containing only the New York stock exchange info “NYA”. (Use Boolean indexing.)Check your answer: the
shape
ofdfny
should be (13948, 8).Call the
describe
method ondfny
.What is the mean value in the “Close” column? What is the median value in the “Close” column? (Do you find the difference surprising?)
dfny = df[df["Name"]=="NYA"]
dfny.shape
(13948, 8)
dfny.describe()
Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|
count | 13947.000000 | 13947.000000 | 13947.000000 | 13947.000000 | 13947.000000 | 1.394700e+04 |
mean | 4451.778151 | 4468.236552 | 4433.695014 | 4452.174711 | 4452.174711 | 1.215565e+09 |
std | 4074.953735 | 4095.166918 | 4052.801658 | 4075.459366 | 4075.459366 | 1.834155e+09 |
min | 347.769989 | 347.769989 | 347.769989 | 347.769989 | 347.769989 | 0.000000e+00 |
25% | 654.940002 | 654.940002 | 654.940002 | 654.940002 | 654.940002 | 0.000000e+00 |
50% | 2631.800049 | 2631.800049 | 2631.800049 | 2631.800049 | 2631.800049 | 0.000000e+00 |
75% | 7339.489990 | 7375.719971 | 7276.889893 | 7337.935059 | 7337.935059 | 2.681975e+09 |
max | 16590.429690 | 16685.890630 | 16531.949220 | 16590.429690 | 16590.429690 | 1.145623e+10 |
We can see that the mean value of “Close” is approximately 4452 and the median value is approximately 2631. This is an example where mean and median are very different.
I personally find this
describe
info easier to read if the DataFrame is transposed. Use.T
to transpose the data.
dfny.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Open | 13947.0 | 4.451778e+03 | 4.074954e+03 | 347.769989 | 654.940002 | 2631.800049 | 7.339490e+03 | 1.659043e+04 |
High | 13947.0 | 4.468237e+03 | 4.095167e+03 | 347.769989 | 654.940002 | 2631.800049 | 7.375720e+03 | 1.668589e+04 |
Low | 13947.0 | 4.433695e+03 | 4.052802e+03 | 347.769989 | 654.940002 | 2631.800049 | 7.276890e+03 | 1.653195e+04 |
Close | 13947.0 | 4.452175e+03 | 4.075459e+03 | 347.769989 | 654.940002 | 2631.800049 | 7.337935e+03 | 1.659043e+04 |
Adj Close | 13947.0 | 4.452175e+03 | 4.075459e+03 | 347.769989 | 654.940002 | 2631.800049 | 7.337935e+03 | 1.659043e+04 |
Volume | 13947.0 | 1.215565e+09 | 1.834155e+09 | 0.000000 | 0.000000 | 0.000000 | 2.681975e+09 | 1.145623e+10 |
For example, this tells us that 75% of the values in the “Close” column are below 7337.935, and 25% are above. Check this using Boolean indexing and then finding the length
len(???)
of the resulting DataFrame (len
gives the number of rows).
len(dfny[dfny["Close"]<7337.935])/len(dfny)
0.7499283051333524
What is some information visible in the result of
df.info()
?
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112457 entries, 0 to 112456
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 112457 non-null object
1 Date 112457 non-null object
2 Open 110253 non-null float64
3 High 110253 non-null float64
4 Low 110253 non-null float64
5 Close 110253 non-null float64
6 Adj Close 110253 non-null float64
7 Volume 110253 non-null float64
dtypes: float64(6), object(2)
memory usage: 6.9+ MB
dfny shows lots of information. For example, it says which columns have data type float.
Do you see how to count the number of missing values (or null values) in the “Open” column using the result of
df.info()
? Does your result match the result ofdf.Open.isna().sum()
, which is the “usual” way to count missing values?
# Missing values
112457-110253
2204
df.Open.isna().sum()
2204
value_counts¶
Define a new DataFrame
dfhk
containing only the info from the Hong Kong stock index, “HSI”. (Check: the shape ofdfhk
should be 8750 rows by 8 columns.)
dfhk = df[df["Name"]=="HSI"].copy()
Evaluate
dfhk["Close"].value_counts()
. This will tell you, for example, that3066.00000
occurred 5 times in the “Close” column, and no other value occurred more than 3 times.
dfhk["Close"].value_counts()
3066.000000 5
2584.500000 3
3083.000000 3
2990.000000 3
2871.000000 3
..
9254.120117 1
9402.389648 1
9425.419922 1
9146.799805 1
29151.800780 1
Name: Close, Length: 8336, dtype: int64
It might seem difficult to use value_counts
on this data, because there isn’t much repetition. Here is a nice way.
Create a new column using the line
dfhk["Diff"] = dfhk["High"] - dfhk["Low"]
. If you get a pandas warning “A value is trying to be set on a copy of a slice from a DataFrame”, go back to the line where you defineddfhk
and add a.copy()
at the end of the line.
dfhk["Diff"] = dfhk["High"] - dfhk["Low"]
Look at the top 4 rows of
dfhk
, and notice how there is now a ninth column on the right. Try to access these rows two different ways: usingiloc[:4]
(this is called slicing) and usinghead(4)
.
dfhk.iloc[:4]
Name | Date | Open | High | Low | Close | Adj Close | Volume | Diff | |
---|---|---|---|---|---|---|---|---|---|
26638 | HSI | 12/31/86 | 2568.300049 | 2568.300049 | 2568.300049 | 2568.300049 | 2568.300049 | 0.0 | 0.0 |
26639 | HSI | 1/1/87 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
26640 | HSI | 1/2/87 | 2540.100098 | 2540.100098 | 2540.100098 | 2540.100098 | 2540.100098 | 0.0 | 0.0 |
26641 | HSI | 1/5/87 | 2552.399902 | 2552.399902 | 2552.399902 | 2552.399902 | 2552.399902 | 0.0 | 0.0 |
dfhk.head(4)
Name | Date | Open | High | Low | Close | Adj Close | Volume | Diff | |
---|---|---|---|---|---|---|---|---|---|
26638 | HSI | 12/31/86 | 2568.300049 | 2568.300049 | 2568.300049 | 2568.300049 | 2568.300049 | 0.0 | 0.0 |
26639 | HSI | 1/1/87 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
26640 | HSI | 1/2/87 | 2540.100098 | 2540.100098 | 2540.100098 | 2540.100098 | 2540.100098 | 0.0 | 0.0 |
26641 | HSI | 1/5/87 | 2552.399902 | 2552.399902 | 2552.399902 | 2552.399902 | 2552.399902 | 0.0 | 0.0 |
Make a new column called “Volatility” (I’m not sure if this is the technical definition, so I’m just using the word informally) which is the “Diff” column divided by the “Open” column.
dfhk["Volatility"] = dfhk["Diff"]/dfhk["Open"]
Write a markdown cell explaining in words what the “Volatility” column represents.
This “Volatility” column indicates what proportion of the opening value the stocks swung during the day.
Check the
value_counts
for this column usingdfhk["Volatility"].value_counts()
. It looks pretty useless.
dfhk["Volatility"].value_counts()
0.000000 576
0.094471 2
0.007531 2
0.005545 2
0.014085 2
...
0.010021 1
0.026140 1
0.010504 1
0.018395 1
0.010810 1
Name: Volatility, Length: 7909, dtype: int64
But here’s a nice trick. If we compute
dfhk["Volatility"].round(2)
, what is the result?
This rounds each to two decimal places.
dfhk["Volatility"].round(2)
26638 0.00
26639 NaN
26640 0.00
26641 0.00
26642 0.00
...
35383 0.02
35384 0.01
35385 0.01
35386 0.01
35387 0.01
Name: Volatility, Length: 8750, dtype: float64
Now apply
value_counts
to that rounded result.
dfhk["Volatility"].round(2).value_counts()
0.01 4876
0.02 1974
0.00 826
0.03 533
0.04 160
0.05 59
0.06 28
0.07 14
0.08 9
0.09 7
0.11 2
0.13 2
0.17 1
0.10 1
Name: Volatility, dtype: int64
Answer the following quesiton in a markdown cell: How many days in the dataset did the Hong Kong stock market move 6%?
According to this information, it seems like on 28 days, the stock market moved 6% during the day.
Created in Deepnote