Project Title :Stock Price of Famous Companies¶

Course Project, UC Irvine, Math 10, S22

Introduction¶

my dataset is dataset of all the US stock within 5 years but it’s a little big. I make three subset which including three stocks in the stock market, which are AAL, Facebook and Apple. In this project I mainly focs on visualization of stock price and make prediction by using MachineLearning.
There are four parts of this project, first two parts are prediction using linear regression and polynomial regression. Last two parts are visualization by using Plotly and Altair in terms of history stock price of AAL and comparisons between Apple and Facebook

Predict close price of AAL stock¶

Firstly, read the dataset and clean the data,becasue the dataset is too large and altair only shows max row of 5000, here I choose one stock(AAL) sample from original dataset

import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt

df= pd.read_csv("all_stocks_5yr.csv")

df=df.dropna().copy()
df=df[df["Name"]=="AAL"].copy()
df.head()

	date	open	high	low	close	volume	Name
0	2013-02-08	15.07	15.12	14.63	14.75	8407500	AAL
1	2013-02-11	14.89	15.01	14.26	14.46	8882000	AAL
2	2013-02-12	14.45	14.51	14.10	14.27	8126000	AAL
3	2013-02-13	14.30	14.94	14.25	14.66	10259500	AAL
4	2013-02-14	14.94	14.96	13.16	13.99	31879900	AAL

we can see now we drop the missing value and get the ideal AAL dataframe

Secondly, I want to use open, high and low price to predict the close price of a stock

from sklearn.linear_model import LinearRegression 
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

let us visualize the close price history first

plt.figure(figsize=(16,6))
plt.title('Close Price History')
plt.plot(df['close'])
plt.xlabel('index', fontsize=16)
plt.ylabel('Close Price USD ($)', fontsize=13)
plt.show()

reg = LinearRegression()
cols=["open","high","low"]
X_train,X_test,y_train,y_test = train_test_split(df[cols],df["close"],train_size=0.3,random_state=0)

I use linearegression of train set to predict test set here

reg.fit(X_train,y_train)

LinearRegression()

df_sub=X_test
df_sub["Pred"]=reg.predict(X_test)
df_sub["True"]=y_test

df_sub.reset_index()

	index	open	high	low	Pred	True
0	5	13.93	14.6100	13.93	14.474599	14.500
1	494	53.99	54.5200	52.54	53.212523	52.695
2	52	16.55	16.7300	16.19	16.399749	16.220
3	986	46.01	47.3400	45.78	46.869246	47.080
4	186	22.60	23.9000	22.57	23.612153	22.710
...	...	...	...	...	...	...
877	357	42.91	43.2300	42.52	42.831180	42.880
878	353	41.72	41.7200	39.92	40.241059	40.100
879	788	40.90	41.4295	40.77	41.201474	41.350
880	153	18.68	18.8000	18.47	18.602006	18.680
881	566	49.00	50.3000	48.80	49.857651	49.645

882 rows × 6 columns

show how the prediction works

print (f"the predictive closing price for 82.22(open),82.22(high),79.52(low) is {82.22*reg.coef_[0]+82.22*reg.coef_[1]+79.52*reg.coef_[2]+reg.intercept_}")

the predictive closing price for 82.22(open),82.22(high),79.52(low) is 79.9853408195805

we can see here actually the prediction is accurate becasue the difference of value between close price and open price is not too much since it’s just within one day, therefore, I need a new angle for direction

Predict close price based on opening price¶

this part I want to switch a prediction method and I decide to use opening price to predict the close price and compared with true price to compute the error

df["year"]= pd.to_datetime(df["date"]).dt.year

max_deg=5
sub_col=[]
for i in range(1,max_deg+1):
    c=f"open{i}"
    sub_col.append(c)
    df[c]=df["open"]**i
df

	date	open	high	low	close	volume	Name	year	open1	open2	open3	open4	open5
0	2013-02-08	15.07	15.12	14.63	14.75	8407500	AAL	2013	15.07	227.1049	3422.470843	5.157664e+04	7.772599e+05
1	2013-02-11	14.89	15.01	14.26	14.46	8882000	AAL	2013	14.89	221.7121	3301.293169	4.915626e+04	7.319366e+05
2	2013-02-12	14.45	14.51	14.10	14.27	8126000	AAL	2013	14.45	208.8025	3017.196125	4.359848e+04	6.299981e+05
3	2013-02-13	14.30	14.94	14.25	14.66	10259500	AAL	2013	14.30	204.4900	2924.207000	4.181616e+04	5.979711e+05
4	2013-02-14	14.94	14.96	13.16	13.99	31879900	AAL	2013	14.94	223.2036	3334.661784	4.981985e+04	7.443085e+05
...	...	...	...	...	...	...	...	...	...	...	...	...	...
1254	2018-02-01	54.00	54.64	53.59	53.88	3623078	AAL	2018	54.00	2916.0000	157464.000000	8.503056e+06	4.591650e+08
1255	2018-02-02	53.49	53.99	52.03	52.10	5109361	AAL	2018	53.49	2861.1801	153044.523549	8.186352e+06	4.378879e+08
1256	2018-02-05	51.99	52.39	49.75	49.76	6878284	AAL	2018	51.99	2702.9601	140526.895599	7.305993e+06	3.798386e+08
1257	2018-02-06	49.32	51.50	48.79	51.18	6782480	AAL	2018	49.32	2432.4624	119969.045568	5.916873e+06	2.918202e+08
1258	2018-02-07	50.91	51.98	50.89	51.40	4845831	AAL	2018	50.91	2591.8281	131949.968571	6.717573e+06	3.419916e+08

1259 rows × 13 columns

here I set max degree into 5 below for polynomial regression and a “year” column

X_train2,X_test2,y_train2,y_test2 = train_test_split(df[sub_col],df["close"],train_size=0.3,random_state=0)

mse_train_dict = {}

for i in range(1,max_deg+1):
    sub = sub_col[:i]
    reg = LinearRegression()
    reg.fit(X_train2[sub], y_train2)
    mse_train_dict[i] = mean_squared_error(reg.predict(X_train2[sub]), y_train2)
mse_train_dict

{1: 0.613737790029641,
0.6109051382778017,
0.6109024377985044,
0.6106427682807114,
0.6106371158209921}

it looks like pretty similiar becasue in original data, the opening price and the closing price difference are not too much, here we just check 3th to 6th digit since the first two digit are the same.We can see the error keeps going lower so it’s a not bad sign

mse_test_dict = {}

for i in range(1,max_deg+1):
    sub = sub_col[:i]
    reg = LinearRegression()
    reg.fit(X_train2[sub], y_train2)
    mse_test_dict[i] = mean_squared_error(reg.predict(X_test2[sub]), y_test2)
mse_test_dict

{1: 0.5299597972949177,
0.5272501941771051,
0.5272560447897813,
0.5268082561653469,
0.5268494986878971}

here we also keep just 3th to 6th digit position and we can see that after 4th degree, the error goes up a little bit, I assume that 4th degree is relatively perefect but since the overall error is small so I’m not going to add degree and count error here
Overall the relationship between overfitting and underfitting is now obvious here because the low difference of values, but the structure of prediction is ok and it can be effectively used in other datasets.

See stock price by selection of period of time¶

Here I want to show the detailed stock market when selecting certain part of stock market. The whole dataset of AAL is 5 years, but what if I only want to look certain period of stock price given certain year?

df2014=df.loc[df["year"]==2014]

here I just want to show 2014 price because I don’t want the graph of x label to be too messy

brush = alt.selection_interval(encodings=["x"])

c = alt.Chart(df2014).mark_line().encode(
    y='close',
    color=alt.Color ("Name",scale = alt.Scale(scheme = "category20"))
).properties(
    width=600,
    height=100
).add_selection(
    brush
).encode(x='date') 

c_detail = alt.Chart(df2014).mark_line().encode(
    y=alt.Y('close',scale=alt.Scale(zero=False)),
    color=alt.Color ("Name",scale = alt.Scale(scheme = "category20"))
).properties(
    width=600,
    height=400
).transform_filter(
    brush
).encode(x='date') 

c_detail&c

Comparison between two stocks¶

Here I want to see two companies’ history stock price and try to analyze the better one by comparison

alldata=pd.read_csv("all_stocks_5yr.csv")
alldata["date"]=pd.to_datetime(alldata["date"])

alldata.Name.unique()

array(['AAL', 'AAPL', 'AAP', 'ABBV', 'ABC', 'ABT', 'ACN', 'ADBE', 'ADI',
       'ADM', 'ADP', 'ADSK', 'ADS', 'AEE', 'AEP', 'AES', 'AET', 'AFL',
       'AGN', 'AIG', 'AIV', 'AIZ', 'AJG', 'AKAM', 'ALB', 'ALGN', 'ALK',
       'ALLE', 'ALL', 'ALXN', 'AMAT', 'AMD', 'AME', 'AMGN', 'AMG', 'AMP',
       'AMT', 'AMZN', 'ANDV', 'ANSS', 'ANTM', 'AON', 'AOS', 'APA', 'APC',
       'APD', 'APH', 'APTV', 'ARE', 'ARNC', 'ATVI', 'AVB', 'AVGO', 'AVY',
       'AWK', 'AXP', 'AYI', 'AZO', 'A', 'BAC', 'BAX', 'BA', 'BBT', 'BBY',
       'BDX', 'BEN', 'BF.B', 'BHF', 'BHGE', 'BIIB', 'BK', 'BLK', 'BLL',
       'BMY', 'BRK.B', 'BSX', 'BWA', 'BXP', 'CAG', 'CAH', 'CAT', 'CA',
       'CBG', 'CBOE', 'CBS', 'CB', 'CCI', 'CCL', 'CDNS', 'CELG', 'CERN',
       'CFG', 'CF', 'CHD', 'CHK', 'CHRW', 'CHTR', 'CINF', 'CI', 'CLX',
       'CL', 'CMA', 'CMCSA', 'CME', 'CMG', 'CMI', 'CMS', 'CNC', 'CNP',
       'COF', 'COG', 'COL', 'COO', 'COP', 'COST', 'COTY', 'CPB', 'CRM',
       'CSCO', 'CSRA', 'CSX', 'CTAS', 'CTL', 'CTSH', 'CTXS', 'CVS', 'CVX',
       'CXO', 'C', 'DAL', 'DE', 'DFS', 'DGX', 'DG', 'DHI', 'DHR', 'DISCA',
       'DISCK', 'DISH', 'DIS', 'DLR', 'DLTR', 'DOV', 'DPS', 'DRE', 'DRI',
       'DTE', 'DUK', 'DVA', 'DVN', 'DWDP', 'DXC', 'D', 'EA', 'EBAY',
       'ECL', 'ED', 'EFX', 'EIX', 'EL', 'EMN', 'EMR', 'EOG', 'EQIX',
       'EQR', 'EQT', 'ESRX', 'ESS', 'ES', 'ETFC', 'ETN', 'ETR', 'EVHC',
       'EW', 'EXC', 'EXPD', 'EXPE', 'EXR', 'FAST', 'FBHS', 'FB', 'FCX',
       'FDX', 'FE', 'FFIV', 'FISV', 'FIS', 'FITB', 'FLIR', 'FLR', 'FLS',
       'FL', 'FMC', 'FOXA', 'FOX', 'FRT', 'FTI', 'FTV', 'F', 'GD', 'GE',
       'GGP', 'GILD', 'GIS', 'GLW', 'GM', 'GOOGL', 'GOOG', 'GPC', 'GPN',
       'GPS', 'GRMN', 'GS', 'GT', 'GWW', 'HAL', 'HAS', 'HBAN', 'HBI',
       'HCA', 'HCN', 'HCP', 'HD', 'HES', 'HIG', 'HII', 'HLT', 'HOG',
       'HOLX', 'HON', 'HPE', 'HPQ', 'HP', 'HRB', 'HRL', 'HRS', 'HSIC',
       'HST', 'HSY', 'HUM', 'IBM', 'ICE', 'IDXX', 'IFF', 'ILMN', 'INCY',
       'INFO', 'INTC', 'INTU', 'IPG', 'IP', 'IQV', 'IRM', 'IR', 'ISRG',
       'ITW', 'IT', 'IVZ', 'JBHT', 'JCI', 'JEC', 'JNJ', 'JNPR', 'JPM',
       'JWN', 'KEY', 'KHC', 'KIM', 'KLAC', 'KMB', 'KMI', 'KMX', 'KORS',
       'KO', 'KR', 'KSS', 'KSU', 'K', 'LB', 'LEG', 'LEN', 'LH', 'LKQ',
       'LLL', 'LLY', 'LMT', 'LNC', 'LNT', 'LOW', 'LRCX', 'LUK', 'LUV',
       'LYB', 'L', 'MAA', 'MAC', 'MAR', 'MAS', 'MAT', 'MA', 'MCD', 'MCHP',
       'MCK', 'MCO', 'MDLZ', 'MDT', 'MET', 'MGM', 'MHK', 'MKC', 'MLM',
       'MMC', 'MMM', 'MNST', 'MON', 'MOS', 'MO', 'MPC', 'MRK', 'MRO',
       'MSFT', 'MSI', 'MS', 'MTB', 'MTD', 'MU', 'MYL', 'M', 'NAVI', 'NBL',
       'NCLH', 'NDAQ', 'NEE', 'NEM', 'NFLX', 'NFX', 'NI', 'NKE', 'NLSN',
       'NOC', 'NOV', 'NRG', 'NSC', 'NTAP', 'NTRS', 'NUE', 'NVDA', 'NWL',
       'NWSA', 'NWS', 'OKE', 'OMC', 'ORCL', 'ORLY', 'OXY', 'O', 'PAYX',
       'PBCT', 'PCAR', 'PCG', 'PCLN', 'PDCO', 'PEG', 'PEP', 'PFE', 'PFG',
       'PGR', 'PG', 'PHM', 'PH', 'PKG', 'PKI', 'PLD', 'PM', 'PNC', 'PNR',
       'PNW', 'PPG', 'PPL', 'PRGO', 'PRU', 'PSA', 'PSX', 'PVH', 'PWR',
       'PXD', 'PX', 'PYPL', 'QCOM', 'QRVO', 'RCL', 'REGN', 'REG', 'RE',
       'RF', 'RHI', 'RHT', 'RJF', 'RL', 'RMD', 'ROK', 'ROP', 'ROST',
       'RRC', 'RSG', 'RTN', 'SBAC', 'SBUX', 'SCG', 'SCHW', 'SEE', 'SHW',
       'SIG', 'SJM', 'SLB', 'SLG', 'SNA', 'SNI', 'SNPS', 'SO', 'SPGI',
       'SPG', 'SRCL', 'SRE', 'STI', 'STT', 'STX', 'STZ', 'SWKS', 'SWK',
       'SYF', 'SYK', 'SYMC', 'SYY', 'TAP', 'TDG', 'TEL', 'TGT', 'TIF',
       'TJX', 'TMK', 'TMO', 'TPR', 'TRIP', 'TROW', 'TRV', 'TSCO', 'TSN',
       'TSS', 'TWX', 'TXN', 'TXT', 'T', 'UAA', 'UAL', 'UA', 'UDR', 'UHS',
       'ULTA', 'UNH', 'UNM', 'UNP', 'UPS', 'URI', 'USB', 'UTX', 'VAR',
       'VFC', 'VIAB', 'VLO', 'VMC', 'VNO', 'VRSK', 'VRSN', 'VRTX', 'VTR',
       'VZ', 'V', 'WAT', 'WBA', 'WDC', 'WEC', 'WFC', 'WHR', 'WLTW', 'WMB',
       'WMT', 'WM', 'WRK', 'WU', 'WYNN', 'WYN', 'WY', 'XEC', 'XEL',
       'XLNX', 'XL', 'XOM', 'XRAY', 'XRX', 'XYL', 'YUM', 'ZBH', 'ZION',
       'ZTS'], dtype=object)

we can see there are many stocks in the stock market, here we choose two famous one, facebook and apple

fb=alldata[alldata["Name"]=="FB"]
apl=alldata[alldata["Name"]=="AAPL"]
twostk = fb[['date','close']].merge(apl[['date','close']], on = 'date')
twostk.head()

	date	close_x	close_y
0	2013-02-08	28.5450	67.8542
1	2013-02-11	28.2600	68.5614
2	2013-02-12	27.3700	66.8428
3	2013-02-13	27.9075	66.7156
4	2013-02-14	28.5000	66.6556

combine and merge Apple and Facebook stock together, but it now enough and clear for us to see the relationship of them based on time series

import plotly.graph_objs as go

trace0 = go.Scatter(x=apl.date, y=apl.close, name='Apple')
trace1 = go.Scatter(x=fb.date, y=fb.close, name='Facebook')

import matplotlib.pyplot as plt
import plotly.express as px

fig = go.Figure()
fig.add_trace(trace0)
fig.add_trace(trace1)
fig.update_layout(
    title={'text' :"Apple vs Facebook Stock Price"})


fig.show()

We can see here at the 2013, Facebook actually had lower price than apple, but after 2016, Facebook suppressed Apple in terms of price and hold the lead for most of following lwing time

However, we cannot say Facebook has larger scale or larger profitability than Apple, because I think the reason why Apple cannot beat Facebook in terms of stock price is because Apple, as a giant company, has more stock shares so it makes the individual stock share price not too high

Summary¶

In this project I try to used what I learned in math10 about Pandas to find sub dataset and re-incorporate dataset, and use Machinelearning to predict the close price by (open,high low) price and polynomial open price and also compute the error. Also I use the Altair bursh and transform to plot a flexible graph so that I can see the detailed close price for a certain period. Last, I use Plotly as extra topic to plot the history AAL price and compare two famous companies : Apple and Facebook

References¶

What is the source of your dataset(s)?

My source downloaded from Kaggle and the name is S&P 500 stock data https://www.kaggle.com/datasets/camnugent/sandp500/code?datasetId=1908&sortBy=voteCount

Were any portions of the code or ideas taken from another source? List those sources here and say how they were used. the idea of comparing two famous stocks based on time series is adapted from https://www.kaggle.com/code/hakkisimsek/plotly-tutorial-3

The codes about Plotly graphing including go.scatter and matplotlib are learned from
https://plotly.com/python/line-charts/

The idea of predicting close price is adapted from
https://christopherdavisuci.github.io/UCI-Math-10-W22/Proj/StudentProjects/YufeiRen.html

Created in Deepnote

UC Irvine Math 10 S22

Project Title :Stock Price of Famous Companies

Contents