Project Title :Stock Price of Famous Companies

Author: Xuanru Li, xuanrul@uci.edu

Course Project, UC Irvine, Math 10, S22

Introduction

  • my dataset is dataset of all the US stock within 5 years but it’s a little big. I make three subset which including three stocks in the stock market, which are AAL, Facebook and Apple. In this project I mainly focs on visualization of stock price and make prediction by using MachineLearning.

  • There are four parts of this project, first two parts are prediction using linear regression and polynomial regression. Last two parts are visualization by using Plotly and Altair in terms of history stock price of AAL and comparisons between Apple and Facebook

Predict close price of AAL stock

Firstly, read the dataset and clean the data,becasue the dataset is too large and altair only shows max row of 5000, here I choose one stock(AAL) sample from original dataset

import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
df= pd.read_csv("all_stocks_5yr.csv")
df=df.dropna().copy()
df=df[df["Name"]=="AAL"].copy()
df.head()
date open high low close volume Name
0 2013-02-08 15.07 15.12 14.63 14.75 8407500 AAL
1 2013-02-11 14.89 15.01 14.26 14.46 8882000 AAL
2 2013-02-12 14.45 14.51 14.10 14.27 8126000 AAL
3 2013-02-13 14.30 14.94 14.25 14.66 10259500 AAL
4 2013-02-14 14.94 14.96 13.16 13.99 31879900 AAL

we can see now we drop the missing value and get the ideal AAL dataframe

Secondly, I want to use open, high and low price to predict the close price of a stock

from sklearn.linear_model import LinearRegression 
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

let us visualize the close price history first

plt.figure(figsize=(16,6))
plt.title('Close Price History')
plt.plot(df['close'])
plt.xlabel('index', fontsize=16)
plt.ylabel('Close Price USD ($)', fontsize=13)
plt.show()
../../_images/XuanruLi_11_0.png
reg = LinearRegression()
cols=["open","high","low"]
X_train,X_test,y_train,y_test = train_test_split(df[cols],df["close"],train_size=0.3,random_state=0)

I use linearegression of train set to predict test set here

reg.fit(X_train,y_train)
LinearRegression()
df_sub=X_test
df_sub["Pred"]=reg.predict(X_test)
df_sub["True"]=y_test
df_sub.reset_index()
index open high low Pred True
0 5 13.93 14.6100 13.93 14.474599 14.500
1 494 53.99 54.5200 52.54 53.212523 52.695
2 52 16.55 16.7300 16.19 16.399749 16.220
3 986 46.01 47.3400 45.78 46.869246 47.080
4 186 22.60 23.9000 22.57 23.612153 22.710
... ... ... ... ... ... ...
877 357 42.91 43.2300 42.52 42.831180 42.880
878 353 41.72 41.7200 39.92 40.241059 40.100
879 788 40.90 41.4295 40.77 41.201474 41.350
880 153 18.68 18.8000 18.47 18.602006 18.680
881 566 49.00 50.3000 48.80 49.857651 49.645

882 rows × 6 columns

show how the prediction works

print (f"the predictive closing price for 82.22(open),82.22(high),79.52(low) is {82.22*reg.coef_[0]+82.22*reg.coef_[1]+79.52*reg.coef_[2]+reg.intercept_}")
the predictive closing price for 82.22(open),82.22(high),79.52(low) is 79.9853408195805

we can see here actually the prediction is accurate becasue the difference of value between close price and open price is not too much since it’s just within one day, therefore, I need a new angle for direction

Predict close price based on opening price

this part I want to switch a prediction method and I decide to use opening price to predict the close price and compared with true price to compute the error

df["year"]= pd.to_datetime(df["date"]).dt.year
max_deg=5
sub_col=[]
for i in range(1,max_deg+1):
    c=f"open{i}"
    sub_col.append(c)
    df[c]=df["open"]**i
df
date open high low close volume Name year open1 open2 open3 open4 open5
0 2013-02-08 15.07 15.12 14.63 14.75 8407500 AAL 2013 15.07 227.1049 3422.470843 5.157664e+04 7.772599e+05
1 2013-02-11 14.89 15.01 14.26 14.46 8882000 AAL 2013 14.89 221.7121 3301.293169 4.915626e+04 7.319366e+05
2 2013-02-12 14.45 14.51 14.10 14.27 8126000 AAL 2013 14.45 208.8025 3017.196125 4.359848e+04 6.299981e+05
3 2013-02-13 14.30 14.94 14.25 14.66 10259500 AAL 2013 14.30 204.4900 2924.207000 4.181616e+04 5.979711e+05
4 2013-02-14 14.94 14.96 13.16 13.99 31879900 AAL 2013 14.94 223.2036 3334.661784 4.981985e+04 7.443085e+05
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1254 2018-02-01 54.00 54.64 53.59 53.88 3623078 AAL 2018 54.00 2916.0000 157464.000000 8.503056e+06 4.591650e+08
1255 2018-02-02 53.49 53.99 52.03 52.10 5109361 AAL 2018 53.49 2861.1801 153044.523549 8.186352e+06 4.378879e+08
1256 2018-02-05 51.99 52.39 49.75 49.76 6878284 AAL 2018 51.99 2702.9601 140526.895599 7.305993e+06 3.798386e+08
1257 2018-02-06 49.32 51.50 48.79 51.18 6782480 AAL 2018 49.32 2432.4624 119969.045568 5.916873e+06 2.918202e+08
1258 2018-02-07 50.91 51.98 50.89 51.40 4845831 AAL 2018 50.91 2591.8281 131949.968571 6.717573e+06 3.419916e+08

1259 rows × 13 columns

here I set max degree into 5 below for polynomial regression and a “year” column


X_train2,X_test2,y_train2,y_test2 = train_test_split(df[sub_col],df["close"],train_size=0.3,random_state=0)
mse_train_dict = {}

for i in range(1,max_deg+1):
    sub = sub_col[:i]
    reg = LinearRegression()
    reg.fit(X_train2[sub], y_train2)
    mse_train_dict[i] = mean_squared_error(reg.predict(X_train2[sub]), y_train2)
mse_train_dict
{1: 0.613737790029641,
 2: 0.6109051382778017,
 3: 0.6109024377985044,
 4: 0.6106427682807114,
 5: 0.6106371158209921}
  • it looks like pretty similiar becasue in original data, the opening price and the closing price difference are not too much, here we just check 3th to 6th digit since the first two digit are the same.We can see the error keeps going lower so it’s a not bad sign

mse_test_dict = {}

for i in range(1,max_deg+1):
    sub = sub_col[:i]
    reg = LinearRegression()
    reg.fit(X_train2[sub], y_train2)
    mse_test_dict[i] = mean_squared_error(reg.predict(X_test2[sub]), y_test2)
mse_test_dict
{1: 0.5299597972949177,
 2: 0.5272501941771051,
 3: 0.5272560447897813,
 4: 0.5268082561653469,
 5: 0.5268494986878971}
  • here we also keep just 3th to 6th digit position and we can see that after 4th degree, the error goes up a little bit, I assume that 4th degree is relatively perefect but since the overall error is small so I’m not going to add degree and count error here

  • Overall the relationship between overfitting and underfitting is now obvious here because the low difference of values, but the structure of prediction is ok and it can be effectively used in other datasets.

See stock price by selection of period of time

Here I want to show the detailed stock market when selecting certain part of stock market. The whole dataset of AAL is 5 years, but what if I only want to look certain period of stock price given certain year?

df2014=df.loc[df["year"]==2014]

here I just want to show 2014 price because I don’t want the graph of x label to be too messy

brush = alt.selection_interval(encodings=["x"])
c = alt.Chart(df2014).mark_line().encode(
    y='close',
    color=alt.Color ("Name",scale = alt.Scale(scheme = "category20"))
).properties(
    width=600,
    height=100
).add_selection(
    brush
).encode(x='date') 
c_detail = alt.Chart(df2014).mark_line().encode(
    y=alt.Y('close',scale=alt.Scale(zero=False)),
    color=alt.Color ("Name",scale = alt.Scale(scheme = "category20"))
).properties(
    width=600,
    height=400
).transform_filter(
    brush
).encode(x='date') 
c_detail&c

Comparison between two stocks

Here I want to see two companies’ history stock price and try to analyze the better one by comparison

alldata=pd.read_csv("all_stocks_5yr.csv")
alldata["date"]=pd.to_datetime(alldata["date"])
alldata.Name.unique()
array(['AAL', 'AAPL', 'AAP', 'ABBV', 'ABC', 'ABT', 'ACN', 'ADBE', 'ADI',
       'ADM', 'ADP', 'ADSK', 'ADS', 'AEE', 'AEP', 'AES', 'AET', 'AFL',
       'AGN', 'AIG', 'AIV', 'AIZ', 'AJG', 'AKAM', 'ALB', 'ALGN', 'ALK',
       'ALLE', 'ALL', 'ALXN', 'AMAT', 'AMD', 'AME', 'AMGN', 'AMG', 'AMP',
       'AMT', 'AMZN', 'ANDV', 'ANSS', 'ANTM', 'AON', 'AOS', 'APA', 'APC',
       'APD', 'APH', 'APTV', 'ARE', 'ARNC', 'ATVI', 'AVB', 'AVGO', 'AVY',
       'AWK', 'AXP', 'AYI', 'AZO', 'A', 'BAC', 'BAX', 'BA', 'BBT', 'BBY',
       'BDX', 'BEN', 'BF.B', 'BHF', 'BHGE', 'BIIB', 'BK', 'BLK', 'BLL',
       'BMY', 'BRK.B', 'BSX', 'BWA', 'BXP', 'CAG', 'CAH', 'CAT', 'CA',
       'CBG', 'CBOE', 'CBS', 'CB', 'CCI', 'CCL', 'CDNS', 'CELG', 'CERN',
       'CFG', 'CF', 'CHD', 'CHK', 'CHRW', 'CHTR', 'CINF', 'CI', 'CLX',
       'CL', 'CMA', 'CMCSA', 'CME', 'CMG', 'CMI', 'CMS', 'CNC', 'CNP',
       'COF', 'COG', 'COL', 'COO', 'COP', 'COST', 'COTY', 'CPB', 'CRM',
       'CSCO', 'CSRA', 'CSX', 'CTAS', 'CTL', 'CTSH', 'CTXS', 'CVS', 'CVX',
       'CXO', 'C', 'DAL', 'DE', 'DFS', 'DGX', 'DG', 'DHI', 'DHR', 'DISCA',
       'DISCK', 'DISH', 'DIS', 'DLR', 'DLTR', 'DOV', 'DPS', 'DRE', 'DRI',
       'DTE', 'DUK', 'DVA', 'DVN', 'DWDP', 'DXC', 'D', 'EA', 'EBAY',
       'ECL', 'ED', 'EFX', 'EIX', 'EL', 'EMN', 'EMR', 'EOG', 'EQIX',
       'EQR', 'EQT', 'ESRX', 'ESS', 'ES', 'ETFC', 'ETN', 'ETR', 'EVHC',
       'EW', 'EXC', 'EXPD', 'EXPE', 'EXR', 'FAST', 'FBHS', 'FB', 'FCX',
       'FDX', 'FE', 'FFIV', 'FISV', 'FIS', 'FITB', 'FLIR', 'FLR', 'FLS',
       'FL', 'FMC', 'FOXA', 'FOX', 'FRT', 'FTI', 'FTV', 'F', 'GD', 'GE',
       'GGP', 'GILD', 'GIS', 'GLW', 'GM', 'GOOGL', 'GOOG', 'GPC', 'GPN',
       'GPS', 'GRMN', 'GS', 'GT', 'GWW', 'HAL', 'HAS', 'HBAN', 'HBI',
       'HCA', 'HCN', 'HCP', 'HD', 'HES', 'HIG', 'HII', 'HLT', 'HOG',
       'HOLX', 'HON', 'HPE', 'HPQ', 'HP', 'HRB', 'HRL', 'HRS', 'HSIC',
       'HST', 'HSY', 'HUM', 'IBM', 'ICE', 'IDXX', 'IFF', 'ILMN', 'INCY',
       'INFO', 'INTC', 'INTU', 'IPG', 'IP', 'IQV', 'IRM', 'IR', 'ISRG',
       'ITW', 'IT', 'IVZ', 'JBHT', 'JCI', 'JEC', 'JNJ', 'JNPR', 'JPM',
       'JWN', 'KEY', 'KHC', 'KIM', 'KLAC', 'KMB', 'KMI', 'KMX', 'KORS',
       'KO', 'KR', 'KSS', 'KSU', 'K', 'LB', 'LEG', 'LEN', 'LH', 'LKQ',
       'LLL', 'LLY', 'LMT', 'LNC', 'LNT', 'LOW', 'LRCX', 'LUK', 'LUV',
       'LYB', 'L', 'MAA', 'MAC', 'MAR', 'MAS', 'MAT', 'MA', 'MCD', 'MCHP',
       'MCK', 'MCO', 'MDLZ', 'MDT', 'MET', 'MGM', 'MHK', 'MKC', 'MLM',
       'MMC', 'MMM', 'MNST', 'MON', 'MOS', 'MO', 'MPC', 'MRK', 'MRO',
       'MSFT', 'MSI', 'MS', 'MTB', 'MTD', 'MU', 'MYL', 'M', 'NAVI', 'NBL',
       'NCLH', 'NDAQ', 'NEE', 'NEM', 'NFLX', 'NFX', 'NI', 'NKE', 'NLSN',
       'NOC', 'NOV', 'NRG', 'NSC', 'NTAP', 'NTRS', 'NUE', 'NVDA', 'NWL',
       'NWSA', 'NWS', 'OKE', 'OMC', 'ORCL', 'ORLY', 'OXY', 'O', 'PAYX',
       'PBCT', 'PCAR', 'PCG', 'PCLN', 'PDCO', 'PEG', 'PEP', 'PFE', 'PFG',
       'PGR', 'PG', 'PHM', 'PH', 'PKG', 'PKI', 'PLD', 'PM', 'PNC', 'PNR',
       'PNW', 'PPG', 'PPL', 'PRGO', 'PRU', 'PSA', 'PSX', 'PVH', 'PWR',
       'PXD', 'PX', 'PYPL', 'QCOM', 'QRVO', 'RCL', 'REGN', 'REG', 'RE',
       'RF', 'RHI', 'RHT', 'RJF', 'RL', 'RMD', 'ROK', 'ROP', 'ROST',
       'RRC', 'RSG', 'RTN', 'SBAC', 'SBUX', 'SCG', 'SCHW', 'SEE', 'SHW',
       'SIG', 'SJM', 'SLB', 'SLG', 'SNA', 'SNI', 'SNPS', 'SO', 'SPGI',
       'SPG', 'SRCL', 'SRE', 'STI', 'STT', 'STX', 'STZ', 'SWKS', 'SWK',
       'SYF', 'SYK', 'SYMC', 'SYY', 'TAP', 'TDG', 'TEL', 'TGT', 'TIF',
       'TJX', 'TMK', 'TMO', 'TPR', 'TRIP', 'TROW', 'TRV', 'TSCO', 'TSN',
       'TSS', 'TWX', 'TXN', 'TXT', 'T', 'UAA', 'UAL', 'UA', 'UDR', 'UHS',
       'ULTA', 'UNH', 'UNM', 'UNP', 'UPS', 'URI', 'USB', 'UTX', 'VAR',
       'VFC', 'VIAB', 'VLO', 'VMC', 'VNO', 'VRSK', 'VRSN', 'VRTX', 'VTR',
       'VZ', 'V', 'WAT', 'WBA', 'WDC', 'WEC', 'WFC', 'WHR', 'WLTW', 'WMB',
       'WMT', 'WM', 'WRK', 'WU', 'WYNN', 'WYN', 'WY', 'XEC', 'XEL',
       'XLNX', 'XL', 'XOM', 'XRAY', 'XRX', 'XYL', 'YUM', 'ZBH', 'ZION',
       'ZTS'], dtype=object)

we can see there are many stocks in the stock market, here we choose two famous one, facebook and apple

fb=alldata[alldata["Name"]=="FB"]
apl=alldata[alldata["Name"]=="AAPL"]
twostk = fb[['date','close']].merge(apl[['date','close']], on = 'date')
twostk.head()
date close_x close_y
0 2013-02-08 28.5450 67.8542
1 2013-02-11 28.2600 68.5614
2 2013-02-12 27.3700 66.8428
3 2013-02-13 27.9075 66.7156
4 2013-02-14 28.5000 66.6556

combine and merge Apple and Facebook stock together, but it now enough and clear for us to see the relationship of them based on time series


import plotly.graph_objs as go

trace0 = go.Scatter(x=apl.date, y=apl.close, name='Apple')
trace1 = go.Scatter(x=fb.date, y=fb.close, name='Facebook')
import matplotlib.pyplot as plt
import plotly.express as px
fig = go.Figure()
fig.add_trace(trace0)
fig.add_trace(trace1)
fig.update_layout(
    title={'text' :"Apple vs Facebook Stock Price"})


fig.show()

We can see here at the 2013, Facebook actually had lower price than apple, but after 2016, Facebook suppressed Apple in terms of price and hold the lead for most of following lwing time

However, we cannot say Facebook has larger scale or larger profitability than Apple, because I think the reason why Apple cannot beat Facebook in terms of stock price is because Apple, as a giant company, has more stock shares so it makes the individual stock share price not too high

Summary

In this project I try to used what I learned in math10 about Pandas to find sub dataset and re-incorporate dataset, and use Machinelearning to predict the close price by (open,high low) price and polynomial open price and also compute the error. Also I use the Altair bursh and transform to plot a flexible graph so that I can see the detailed close price for a certain period. Last, I use Plotly as extra topic to plot the history AAL price and compare two famous companies : Apple and Facebook

References

  • What is the source of your dataset(s)?

My source downloaded from Kaggle and the name is S&P 500 stock data https://www.kaggle.com/datasets/camnugent/sandp500/code?datasetId=1908&sortBy=voteCount

The codes about Plotly graphing including go.scatter and matplotlib are learned from
https://plotly.com/python/line-charts/

The idea of predicting close price is adapted from
https://christopherdavisuci.github.io/UCI-Math-10-W22/Proj/StudentProjects/YufeiRen.html

Created in deepnote.com Created in Deepnote