Netflix Stock Price#

Author: Huangxiao Zhang

Course Project, UC Irvine, Math 10, F22

Introduction#

In this predicting project, I am going to make a prediction of Netflix Stock price since I am a big fan of this comany. The data is the price for netflix stock from 2002 to 2021. In the project, I am going to use linear regression and decision tree’s Predicted values compare with truth value.

Importing data#

import numpy as np
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import seaborn as sns
df = pd.read_csv('netflix.csv')

Sorting Data#

Original Dataset#

df
Date High Low Open Close Volume Adj Close
0 2002-05-23 1.242857 1.145714 1.156429 1.196429 104790000.0 1.196429
1 2002-05-24 1.225000 1.197143 1.214286 1.210000 11104800.0 1.210000
2 2002-05-28 1.232143 1.157143 1.213571 1.157143 6609400.0 1.157143
3 2002-05-29 1.164286 1.085714 1.164286 1.103571 6757800.0 1.103571
4 2002-05-30 1.107857 1.071429 1.107857 1.071429 10154200.0 1.071429
... ... ... ... ... ... ... ...
4876 2021-10-05 640.390015 606.890015 606.940002 634.809998 9534300.0 634.809998
4877 2021-10-06 639.869995 626.359985 628.179993 639.099976 4580400.0 639.099976
4878 2021-10-07 646.840027 630.450012 642.229980 631.849976 3556900.0 631.849976
4879 2021-10-08 643.799988 630.859985 634.169983 632.659973 3271100.0 632.659973
4880 2021-10-11 639.419983 626.780029 633.200012 627.039978 2861200.0 627.039978

4881 rows × 7 columns

Rename Adj Close to Adjusted Closing Price.(Definition of adjusted closing price)

df = df.rename(columns={'Adj Close' : 'Adjusted Closing Price'}) 

Change type from object to datetime64[ns]

df["Date"] = pd.to_datetime(df["Date"])

Clean null value

df.dropna(inplace=True)

Clean Dataset#

df
Date High Low Open Close Volume Adjusted Closing Price
0 2002-05-23 1.242857 1.145714 1.156429 1.196429 104790000.0 1.196429
1 2002-05-24 1.225000 1.197143 1.214286 1.210000 11104800.0 1.210000
2 2002-05-28 1.232143 1.157143 1.213571 1.157143 6609400.0 1.157143
3 2002-05-29 1.164286 1.085714 1.164286 1.103571 6757800.0 1.103571
4 2002-05-30 1.107857 1.071429 1.107857 1.071429 10154200.0 1.071429
... ... ... ... ... ... ... ...
4876 2021-10-05 640.390015 606.890015 606.940002 634.809998 9534300.0 634.809998
4877 2021-10-06 639.869995 626.359985 628.179993 639.099976 4580400.0 639.099976
4878 2021-10-07 646.840027 630.450012 642.229980 631.849976 3556900.0 631.849976
4879 2021-10-08 643.799988 630.859985 634.169983 632.659973 3271100.0 632.659973
4880 2021-10-11 639.419983 626.780029 633.200012 627.039978 2861200.0 627.039978

4881 rows × 7 columns

Seaborn and Altair Chart#

This is the line chart which shows the Adjusted Closing Price increase with time by sns.lineplot.

sns.set_theme(style="whitegrid")
sns.lineplot(x="Date", y="Adjusted Closing Price", data=df).set(title='Date and Adjusted Closing Price')
[Text(0.5, 1.0, 'Date and Adjusted Closing Price')]
../../_images/HuangxiaoZhang_18_1.png

This is a bar chart which indicates the changes of volume is cyclical

chart = alt.Chart(df).mark_bar().encode(
    x='Date',
    y='Open',
).properties(
    title='Date and Open Price'
)
chart

Build Training and Test Set#

Split the dataset into 2 parts: X includes Highest Price, Lowest Price, Openning Price, and Volume; y includes Adjusted Closing Price.

X = df.loc[:, ["High", "Low", "Open", "Volume"]]
y = df.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

Linear Regression#

Train the data

reg = LinearRegression()
reg.fit(X_train, y_train)
LinearRegression()

predicting value for Linear Regression

linear_pred = reg.predict(X_test)

find the mean squared error for Linear Regression

linear_mse = mean_squared_error(y_test, linear_pred)
linear_mse
2.0652666805811815

Decision Tree Regressor#

Train the data

dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)
DecisionTreeRegressor()

predicting value for DecisionTree

dt_pred = dt.predict(X_test)

find the mean squared error for DecisionTree

dt_mse = mean_squared_error(y_test, dt_pred)
dt_mse
8.674851664178735

Result#

Comparing Truth value and Predicted value within one chart. sns.scatterplot

sns.set_theme()
results = pd.DataFrame({
    "Type": ["Linear"]*y_test.shape[0] + ["DT"]*y_test.shape[0], 
    "Truth": y_test.tolist() * 2,
    "Pred": linear_pred.tolist() + dt_pred.tolist()
})

sns.scatterplot(x="Truth", y="Pred", hue="Type", data=results, alpha=0.5, s=9).set(title='Truth vs. Pred for Linear and DT')
[Text(0.5, 1.0, 'Truth vs. Pred for Linear and DT')]
../../_images/HuangxiaoZhang_40_1.png

Summary#

At the beginning, I plot two images to show the changes of Adjusted Closing Price and volume. The first image is a line chart which shows the Adjusted Closing Price increase with time while the second image is a bar chart which indicates the changes of volume is cyclical. Next I split the dataset into 2 parts, one with 80% random samples as train set and the remaine 20% random samples as test set. I fit a linear regression model and a decision model based on train set and evaluate the performances of them by these sets. The results shows that the mean squared error of the linear model is 2.0652666805811815 while the mean squared error of the decision tree is 8.75387563145938. Finally I plot a scatterplot which the x axis stands for the value of Adjusted Closing Price in the test set and the y axis represents the value of predictions of the two models. The scatterplot shows that both models has a relative wonderful performances.

References#

Your code above should include references. Here is some additional space for references. sns.lineplot sns.scatterplot

  • What is the source of your dataset(s)? Kaggle

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote