# Netflix Stock Price Prediction

Author: Jiayu Wang

Student ID: 74613921

Course Project, UC Irvine, Math 10, W22

## Introduction

Stock market prediction is trying to determine the future value of a company stock. This project will utilize historic data with linear regression to help us predict the future stock values. 



## Section 1 Clean Dataset


In [None]:
import pandas as pd
import altair as alt 

In [None]:
nfstocks = pd.read_csv("/work/netflix_stock.csv")
nfstocks.dropna(inplace=True)

In [None]:
nfstocks

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2002-05-23,1.156429,1.242857,1.145714,1.196429,1.196429,104790000
1,2002-05-24,1.214286,1.225000,1.197143,1.210000,1.210000,11104800
2,2002-05-28,1.213571,1.232143,1.157143,1.157143,1.157143,6609400
3,2002-05-29,1.164286,1.164286,1.085714,1.103571,1.103571,6757800
4,2002-05-30,1.107857,1.107857,1.071429,1.071429,1.071429,10154200
...,...,...,...,...,...,...,...
4940,2022-01-05,592.000000,592.840027,566.880005,567.520020,567.520020,4148700
4941,2022-01-06,554.340027,563.359985,542.010010,553.289978,553.289978,5711800
4942,2022-01-07,549.460022,553.429993,538.219971,541.059998,541.059998,3381700
4943,2022-01-10,538.489990,543.690002,526.320007,539.849976,539.849976,4486100


We would only need data "Date" and "Adj Close." 

In [None]:
nfstocks['Date'] = pd.to_datetime(nfstocks['Date'])

Here we want to use adjusted closing price, which is "Adj Close." The adjusted closing price amends a stock's closing price to reflect that stock's value after accounting for any corporate actions. Closing price is the raw price which is just the cash value of last transcted price before market closes. 

Using adjusted closing prices since these fully incorporate any splits, dividens, spin-offs and other distributions made by trader. 

In [None]:
nfstocks1 = nfstocks[['Date','Adj Close']]

Before we start utilize linear regression predict the future trend, we want to see whats the trend for stock prices in the past 20 years. 

In [None]:
import altair as alt
from altair import Chart, X, Y
import numpy as np

In [None]:
nearest = alt.selection(type='single', nearest=True, on='mouseover',
fields=["Date"], empty='none')

# The basic line
line = alt.Chart().mark_line(interpolate='basis').encode(
    alt.X('Date:T', axis=alt.Axis(title='')),
    alt.Y('Adj Close:Q', axis=alt.Axis(title='',format='$f')),
    color='symbol:N'
).properties(title = "Stock Price")

selectors = alt.Chart().mark_point().encode(
    x="Date:T",
    opacity=alt.value(0),
).add_selection(
    nearest
)

# Draw points 
points = line.mark_point().encode(
    opacity=alt.condition(nearest, alt.value(1), alt.value(0))
)

text = line.mark_text(align='left', dx=5, dy=-5).encode(
    text=alt.condition(nearest, 'Adj Close:Q', alt.value(' '))
)

# Draw a rule
rules = alt.Chart().mark_rule(color="gray").encode(
    x="Date:T",
).transform_filter(
    nearest
)


stockChart = alt.layer(selectors, line, points, rules, text,data=nfstocks1).add_selection(nearest)

In [None]:
stockChart

This graph is reflecting the price changes over the span of 20 years which is from 1.156 dollars to 537.00 dollars . We can see a clear trend that the prices is continuing to increase.

Now we are going to utilize Linear Regression to help us predict the future price trend based on historic data. 

## Section 2: Expoential Moving Average (EMA)

We know that in predicting stock prices utilize techinical analysis is crucial. 

* Techinical analysis indicator determines the support and resistance levels. This help indicate whether the prices has dropped lower or climbed higher. 

* Techinical indicators are heuristic or pattern-based signals produced by the price, volume, and/or open interest of a security or contract used by traders who follow techinical analysis. 

Based on some research online, we will add exponential moving average to our existing data set. 

* Expoential moving average is a type of moving average that places a greater weight and significance on the most recent data points. 

In [None]:
import pandas as pd 

nfstocks1['EMA'] = nfstocks1['Adj Close'].ewm(span=20, min_periods=0,adjust=False,ignore_na=False).mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Short term traders usually rely on 12 to 26 day EMA. Especially that EMA reacts more quickly to price swings than the SMA, but it will lag quite a bit over longer periods. 

In order to test the accuracy of the EMA calculation, we searched that for 2021.06.03, on MarketWatch website shows that EMA for Netflix is 499.77 USD, which matches our calculations. 

In [None]:
c1 = Chart(nfstocks1).mark_line().encode(
    x = "Date",
    y = "Adj Close",
    color = "symbol:N"
).properties(title= "Stock Price")

In [None]:
c2 = Chart(nfstocks1).mark_line(color ="red").encode(
    x = "Date",
    y = "EMA"
)

In [None]:
c1 + c2

Now we are going to develop our regression model and see how effective the EMA is at predicting the price of the stock. 

We are first going to use the 80/20 split to train and test our data

## Section 3: Linear Regression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler


In [None]:
X_train, X_test, y_train, y_test = train_test_split(nfstocks1[['Adj Close']],nfstocks1[['EMA']], test_size = .2)

In [None]:
#test set
print(X_test.describe())

        Adj Close
count  989.000000
mean   120.458023
std    174.670201
min      0.459286
25%      3.940000
50%     25.214287
75%    163.070007
max    691.690002


In [None]:
#training set
print(X_train.describe())

         Adj Close
count  3956.000000
mean    110.194928
std     165.076349
min       0.372857
25%       3.920893
50%      23.812143
75%     130.042499
max     682.609985


Training Model

In [None]:
from sklearn.linear_model import LinearRegression 

reg = LinearRegression()

In [None]:
reg.fit(X_train, y_train)

LinearRegression()

In [None]:
y_pred = reg.predict(X_test)

Now we are going to use mean absolute error and coefficient of determination to examine how well this model fits and examine the coefficient. 

In [None]:
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

In [None]:
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print("Coefficient of Determination:", r2_score(y_test, y_pred))

Mean Absolute Error: 4.544997954933673
Coefficient of Determination: 0.9969900789763098


We know that Mean Absolute Error can be described as the sum of the absolute error for all observed values divided by the total number of observations. Therefore, the lower MAE we get, the better. 

For the coefficient of determination(R_squared), we know that it has of valuyes of 1 or 0 will indicate the regression line represents all or none of the data. Therefore, we would want our coefficient is higher (closer to 1.0) since it helps to indicate that it is a better fit for the observation. 

Based on the ideas and output above, we know that our regression from the MAE and R-squared perspectives that they are a good fit. 

Now we want to utilize the graph to show the observed value and predicted values 

In [None]:
df1 = pd.DataFrame(nfstocks1["Adj Close"].iloc[500:510])

In [None]:
df1["Date"] = pd.DataFrame(nfstocks1['Date'].iloc[500:510])

In [None]:

c3 = alt.Chart(df1).mark_circle().encode(
    x = alt.X("Date",scale=alt.Scale(zero=False)),
    y = alt.Y("Adj Close",scale=alt.Scale (zero=False))
)
    

In [None]:
nfstocks1["pred"] = reg.predict(nfstocks1[["Adj Close"]])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
df1["prediction"] = pd.DataFrame(nfstocks1['pred'].iloc[500:510])

In [None]:
c4 = alt.Chart(df1).mark_line(color="red").encode(
    x = alt.X("Date",scale=alt.Scale(zero=False)),
    y = alt.Y("prediction",scale=alt.Scale (zero=False))
)

In [None]:
c3+c4

## Section 4: Simulation to Test the Model

Since we already developed and trained the model based on historic pricing data. Now we want to develop the model that can use EMA of any given days to repdict the close price. 

We also want to use a trading strategy such as if our predicted value of the stock is higher than the open value of the stock, we will consider to trade. However, if our predicted stock price is equal to or smaller than open value of the stock, we will consider not trade. 

Input some data that we already have to test: (Here I choose data that are from consecutive days since the prediction of current stock price is based on the EMA value from the day before)

In [None]:
df2 = pd.DataFrame(nfstocks["Date"].iloc[4900:4944])
df2['Open'] = nfstocks['Open'].iloc[4900:4944]
df2['Adj Close'] = nfstocks['Adj Close'].iloc[4900:4944]
df2['EMA'] = nfstocks1['EMA'].iloc[4900:4944]

Predicted Value

In [None]:
df2['predict'] = reg.predict(df2[["EMA"]].values)

  "X does not have valid feature names, but"


In [None]:
#Here we use the conditions method, link is below reference
conditions = [df2['predict'] > df2['Open'],df2['predict'] < df2['Open']]
choices = ['Trade','Not Trade']

In [None]:
df2['Trade Decision'] = np.select(conditions, choices, default='Not Trade')

Now we want to create a seperate column with the potential loss/earning that we can make

In [None]:
df2['Earning'] = df2["predict"]-df2["Open"]

Before providing a direct view of potential earnings, you can utilize the function below to get the predicted value of the day. 

You can follows the steps below:

In [None]:
#First, you can put your open value of the day, here we take 598.179993 
#as an example 
open = 598.179993	

In [None]:
#Second, run this block, you will get to know what is the predicted close price will be based on your open price
close = df2['Adj Close'].where(df2['Open'] == open).dropna().values[0]
print(f'If you have an open price as ${open}, your predicted close price will be ${close}')

If you have an open price as $598.179993, your predicted close price will be $605.039978


In [None]:
import matplotlib as mpl 

In order to give a more direct view of potential earnings, we highlighted the earning based on its value. If we are able

In [None]:
def style_negative(v, props='color:red;'):
    return props if v < 0 else None

In [None]:
df2 = df2.style.applymap(style_negative,subset=["Earning"])

In [None]:
def highlight_max(s, props = ''):
    return np.where(s == np.nanmax(s.values), props, '')

In [None]:
df2.apply(highlight_max, props='color:white;background-color:darkblue')

Unnamed: 0,Date,Open,Adj Close,EMA,predict,Trade Decision,Earning
4900,2021-11-08 00:00:00,650.289978,651.450012,654.898466,648.444913,Not Trade,-1.845065
4901,2021-11-09 00:00:00,653.700012,655.98999,655.002421,648.547848,Not Trade,-5.152164
4902,2021-11-10 00:00:00,653.01001,646.909973,654.231712,647.784696,Not Trade,-5.225314
4903,2021-11-11 00:00:00,650.23999,657.580017,654.550598,648.100455,Not Trade,-2.139535
4904,2021-11-12 00:00:00,660.01001,682.609985,657.222921,650.746573,Not Trade,-9.263437
4905,2021-11-15 00:00:00,681.23999,679.330017,659.328358,652.831365,Not Trade,-28.408625
4906,2021-11-16 00:00:00,678.27002,687.400024,662.00185,655.478641,Not Trade,-22.791379
4907,2021-11-17 00:00:00,690.0,691.690002,664.829293,658.278359,Not Trade,-31.721641
4908,2021-11-18 00:00:00,691.609985,682.02002,666.466505,659.899517,Not Trade,-31.710468
4909,2021-11-19 00:00:00,692.349976,678.799988,667.641123,661.062616,Not Trade,-31.28736


Based on the graph above, we are able to see the highlights of the highest value of Netflix stock price in the selected 44 days. We also highlighted the trade areas to show you that the days that are reccomendated to trade with poential earnings. 

## Summary

In this project, we first cleaned and organized the data set. With the help of the graph, we are able to see the trend of the stock performances from 2002 to 2021. Later, we added another factor EMA, Expoential Moving Average, to our dataset and utilize the linear regression to train the model and help us predict the future stock prices. In the end, we utilize some data in the original dataset and ran simulations to test the effecitveness and application of this model. 

## References
* Dataset: Kaggle Dataset

* EMA: https://stackoverflow.com/questions/48775841/pandas-ema-not-matching-the-stocks-ema

* Conditions: https://www.statology.org/compare-two-columns-in-pandas/

* altair chart: https://altair-viz.github.io/gallery/multiline_tooltip.html

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=9242c623-f563-433b-8768-a466c3aa94bb' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>