Energy Consumption#
Author: Adrian Guerra
Course Project, UC Irvine, Math 10, S23
Introduction#
Introduce your project here. Maybe 3 sentences.
In my project I’ll be using altair, linear regression, and pandas in order to take in different datasets about hourly energy consumption. I will be using more outdated data in order to try and predict what the MW usage is for more recent years, along with testing predict with splitting the data between two different times and seeing how the predict relates to the real. I think this will be an interesting idea since the prediction wouldnt take into account things like covid and how thats shifts the energy consumption.
Energy consumption#
You can either have all one section or divide into multiple sections. To make new sections, use ##
in a markdown cell. Double-click this cell for an example of using ##
import pandas as pd
from sklearn.linear_model import LinearRegression
import altair as alt
import numpy as np
from sklearn.model_selection import train_test_split
csv_files = ['AEP_hourly.csv', 'COMED_hourly.csv', 'DAYTON_hourly.csv','DEOK_hourly.csv','DOM_hourly.csv','DUQ_hourly.csv','EKPC_hourly.csv','FE_hourly.csv','NI_hourly.csv','pjm_hourly_est.csv','PJM_Load_hourly.csv','PJME_hourly.csv','PJMW_hourly.csv']
dfs = pd.DataFrame()
for file in csv_files:
df = pd.read_csv(file)
dfs = dfs.append(df, ignore_index=True)
dfs['Datetime'] = pd.to_datetime(dfs['Datetime'])
dfs['Year'] = dfs['Datetime'].dt.year
In the previous cells I was importing and combining the different csvs in order to try and get everything in one place for simplicities sake. Here I am priming up the DataFrame, using pandas in order to seperate the date into a ‘Years’ colunn for later.
dfs = dfs.fillna(0)
yearly_mw = dfs.groupby('Year').sum()
dfs2 = yearly_mw.reset_index()
df_sum = yearly_mw.groupby('Year').sum()
tf = dfs2[(dfs2['Year'] >= 1998) & (dfs2['Year'] <= 2014)]
There are a various amount of things I’m doing here. I am filling up the nan values, grouping whats in dfs into year, reseting the index in order to have ‘Year’ be a column within df_sum, which is the combined MW used in in each column for each year. Finally creating a time frame (tf) in order to try and create a prediction and comparing to the actual usage.
df_melt = tf.melt(id_vars='Year', var_name='CSV File', value_name='MW')
chart = alt.Chart(df_melt).mark_line().encode(
x='Year',
y='MW:Q',
color='CSV File:N',
tooltip=['MW:Q', 'CSV File:N']
).properties(
width=600,
height=400,
title = 'Yearly Usage 1998-2014'
)
chart
Here I am melting the timeframe in order to be able to properly graph each of the columns (or MW per area), then graphing up till 2018.
hist = df_melt[df_melt['Year'] <= 2014]
pred = dfs2[dfs2['Year'] >= 2015]
X = hist[['Year']]
y = hist['MW']
reg = LinearRegression()
reg.fit(X, y)
X_train = hist[['Year']]
y_train = hist['MW']
pred = reg.predict(pred[['Year']])
dfs2.loc[dfs2['Year'] >= 2015, 'MW_Predicted'] = pred
Here I am doing a linear regression fittinf and predicting data in order to get predicted MW numbers for missing years that I’m trying to predict. Specifically for a time after 2014
X_pred = pd.DataFrame({'Year': np.arange(2014, 2019)})
y_pred = reg.predict(X_pred)
hist = pd.DataFrame(hist)
years = np.arange(2014, 2019)
pred_df = pd.DataFrame({'Year': years, 'MW_Predicted': y_pred})
combined = pd.concat([hist, pred_df])
combined = combined.fillna('N/A')
I am creating dataframes that will then be graphed later. You might notice that X_pred is going from 2014-2019, this is to try and make the graph look seemless (even though as you’ll see it still is a split, non continuous one).
pred_chart = alt.Chart(pred_df).mark_line(color = 'red',).encode(
x='Year',
y='MW_Predicted',
tooltip=['Year', 'MW_Predicted']
).properties(
title='Predicted MW Values'
)
average_mw = hist.groupby('Year')['MW'].mean().reset_index()
chart = alt.Chart(average_mw).mark_line().encode(
x='Year',
y='MW',
tooltip=['Year', 'MW']
).properties(
title='Predicted MW Usage'
)
chart2=chart + pred_chart
chart2
The last two cells are making different charts in order to then graph together. The chart is actually interesting since I will be using the mean in order to graph. This is becuase I wanted to combine the different lines from above into one in order to get a better idea as to what the data might be thinking. Rather than there being 14 different line and only one (the predicting one).
dfs2_melt = dfs2.melt(id_vars='Year', var_name='CSV File', value_name='MW')
original_mw = dfs2_melt.groupby('Year')['MW'].mean().reset_index()
chart3 = alt.Chart(original_mw).mark_line(color='purple').encode(
x='Year',
y='MW',
tooltip=['Year', 'MW']
).properties(
title='Actual MW Usage'
)
chart3
Here I had to do another melt and reset of index in order to fix the data once again to include a years columns and ensure it will be collecting the data from all csv files. This is the real values attained from the dataframe.
scatter_plot = alt.Chart(df_melt).mark_circle().encode(
x='Year:O',
y='MW',
color='CSV File',
size='MW',
tooltip=['Year', 'MW', 'CSV File']
).properties(
title='MW Usage by Year Up Until 2014'
)
scatter_plot
This is just a graph I thought would look very interestinf since some places used much more energy than others. I believe we did something like this in class once.
chart4 = chart3+chart2
chart4
This shows the final result. The prediction (red) was projecting a general increase of energy consumtion (which was what I thought as well) but there was a sharp decrease(purple) for some reason I am unaware of. I think its interesting since I do not remember anything particularly important happening in 2018.
Summary#
Here I used altair, pandas, numpy, and sklearn in order to get data about hourly MW usage. From there I limited the data using pandas in order to test the prediction from sklearn. I then used a variety of graphs from altair to show how where the prediction may have faltered and give a possible explanation as to why
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)?
From Kaggle: https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption
List any other references that you found helpful. chatgpt: csv_files = [‘file1.csv’, ‘file2.csv’, ‘file3.csv’] dfs = [] for file in csv_files: df = pd.read_csv(file) dfs.append(df) This code helped me be able to find out how to get all the csv files together
Another helpful code from chat gpt: helping get the years melted_df = combined_df.melt(id_vars=’Year’, var_name=’CSV File’, value_name=’MW’)
I used this code in order to get an idea as to how to switch the y axis, since there was an issue with the y values being upside down: y=alt.Y(‘Negative_Value:Q’, title=’Negative Value’),
I also got some ideas about splitting and comparing with real years from Chris, very appreciated.
Submission#
Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.