Making Predictions of Underwater Ocean Temperature Using Historical Data

Making Predictions of Underwater Ocean Temperature Using Historical Data#

Author: Pamela Martinez

Course Project, UC Irvine, Math 10, S23

Introduction#

In this project, I will be analyzing the behavior of ocean temperature based on historical data to understand if the ocean water tempertaures are heating up or cooling down. Understanding the fluctuating heat temperatures are important to learn about because the temperature in the ocean is what helps the ocean environment thrive so any change in temperature can cause the death of many marine life. Also, understanding the temperatures of the ocean gives us a better understanding about the global climate change that is occurring.

Uploading and Cleaning up data#

import altair as alt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from pandas.api.types import is_numeric_dtype
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score

df = pd.read_csv("underwater_temperature.csv")[:2000]
df

	ID	Site	Latitude	Longitude	Date	Time	Temp (�C)	Depth
0	1	Ilha Deserta	27.2706	48.331	2013/02/20	11:40:02	24.448	12.0
1	2	Ilha Deserta	27.2706	48.331	2013/02/20	12:00:03	24.448	12.0
2	3	Ilha Deserta	27.2706	48.331	2013/02/20	12:20:04	24.545	12.0
3	4	Ilha Deserta	27.2706	48.331	2013/02/20	12:40:05	24.448	12.0
4	5	Ilha Deserta	27.2706	48.331	2013/02/20	13:00:06	24.351	12.0
...	...	...	...	...	...	...	...	...
1995	1996	Ilha Deserta	27.2706	48.331	2013/03/20	5:13:17	24.351	12.0
1996	1997	Ilha Deserta	27.2706	48.331	2013/03/20	5:33:18	24.351	12.0
1997	1998	Ilha Deserta	27.2706	48.331	2013/03/20	5:53:19	24.351	12.0
1998	1999	Ilha Deserta	27.2706	48.331	2013/03/20	6:13:20	24.351	12.0
1999	2000	Ilha Deserta	27.2706	48.331	2013/03/20	6:33:21	24.351	12.0

2000 rows × 8 columns

#Checking if the dataframe has any missing values
df.isnull().values.any()

False

df.shape

(2000, 8)

Now that we have our dataset, we need to clean it up. I will be deleting the date and Timecolumn since it will not be needed.

df.drop(["Date", "Time", "ID","Site"], axis=1,inplace=True)
df.dropna(inplace=True)

df.head()

	Latitude	Longitude	Temp (�C)	Depth
0	27.2706	48.331	24.448	12.0
1	27.2706	48.331	24.448	12.0
2	27.2706	48.331	24.545	12.0
3	27.2706	48.331	24.448	12.0
4	27.2706	48.331	24.351	12.0

Now I will change the "Temp (�C)" column to "Temp" so the dataframe can look much more cleaner.

df.rename(columns = {"Temp (�C)":"Temp"}, inplace=True)

df.head()

	Latitude	Longitude	Temp	Depth
0	27.2706	48.331	24.448	12.0
1	27.2706	48.331	24.448	12.0
2	27.2706	48.331	24.545	12.0
3	27.2706	48.331	24.448	12.0
4	27.2706	48.331	24.351	12.0

Now that we have our dataframe, we can explore its features. Then using the plt.hist it generates a histogram using the target column we will be working with, in this case “Temp”, with bin width 20 to get a better visualization of the fluxuation of temperature.

df.info()
df.describe()
plt.hist(df['Temp'], bins=20)
plt.show()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Latitude   2000 non-null   float64
 1   Longitude  2000 non-null   float64
 2   Temp       2000 non-null   float64
 3   Depth      2000 non-null   float64
dtypes: float64(4)
memory usage: 78.1 KB

../../_images/535da8e8956d8b412a4214a850c11659a3639d033c547a3712ba5f40ec90ef57.png

Machine Learning#

Random Forest Regression#

Now that we have our dataset, we need to prepare it for regression.

features = [col for col in df.columns if is_numeric_dtype(df[col]) and col != "Temp"]

X = df[features]
y = df["Temp"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=9
)

Now that I have split the data into the train set and the test set we can start the exploration with predicting ocean temperature values. First we will use LinearRegression to observe the linear relationship between tempertaure and longtitude, latitude, and depth.

lin = LinearRegression()
lin.fit(df[features], df["Temp"])

LinearRegression()

df["Lin_Pred"] = lin.predict(df[features])

pd.Series(lin.coef_, index=lin.feature_names_in_)

Latitude    -6.555406e+26
Longitude   -1.638851e+26
Depth        0.000000e+00
dtype: float64

Now that we have the LinearRegression part, we can fit our data into the LinearRegression model to have our linear predicted values for temperature.

lin.fit(X_train, y_train)

# Predict using the test set
y_lin_pred = lin.predict(X_test)

# Calculate the root mean squared error (RMSE)
mse_lin = mean_squared_error(y_test, y_lin_pred)
rmse_lin = np.sqrt(mse_lin)
rmse_lin

0.6389837292969595

alt.Chart(df).mark_circle(color="purple").encode(
    x = "Temp",
    y = "Lin_Pred"
)

Based on the altair chart above, it does not seem like there is much to see, however, based on the actual temperatures we can see that the predicted temperatures falls close in the range of the predicted temperatures.

Random Forest Regression#

Next, I will use RandomForestRegression to see if there is any non-linear relationships between our features and the target variable, which is something LinearRegression fails to accurately predict since it focuses on the linear relationship. By also using Randon Forest Reegression, it is less prone to overfitting the data and handles potential missing values or outliers in our data.

rfr = RandomForestRegressor(random_state=14)
rfr.fit(df[features], df["Temp"])

RandomForestRegressor(random_state=14)

df["Rfr_pred"] = rfr.predict(df[features])

rfr.fit(X_train, y_train)
# Predict using the test set
y_rfr_pred = rfr.predict(X_test)

# Calculate the root mean squared error (RMSE)
rfr_mse = mean_squared_error(y_test, rfr_pred)
rmse_rfr = np.sqrt(rfr_mse)
rmse_rfr

0.6390578648446843

alt.Chart(df).mark_circle().encode(
    x = "Temp",
    y = "Rfc_pred:N"
).properties(
    height = 60
)

In comparision to the linear regression model, the random forest regression model is similar but the only thing that makes it differnt is that it does not have a defined value of depth for where the temperature is similar to the actual temperature.

K-Nearest Regression#

Now that I have used both Linear Regression and Random Forest Regression models, I will use the K-Nearest Regression model. The K-Nearest Regression model makes the predicted based on the average of the target values of its K- nearest neighbors.

knn = KNeighborsRegressor()
knn.fit(df[features], df["Temp"])

KNeighborsRegressor()

df["Knn_pred"] = knn.predict(df[features])

knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict using the test set
y_knn_pred = knn.predict(X_test)

# Calculate the root mean squared error (RMSE)
mse_knn = mean_squared_error(y_test, y_knn_pred)
rmse_knn = np.sqrt(mse_knn)
rmse_knn

0.7402467899288725

alt.Chart(df).mark_circle().encode(
    x = "Temp",
    y = "Knn_pred"
)

In comparison to the Linear Regression and Random Forest Regression chart, the K-Nearest Neighbor Regression chart seems to be the most accurate one out of the three. The K-Nearest Neighbor Regression takes into consideration the outliers and avoids overfitting the data, so at this point of the data analysis the KNN model fits the data best to predict ocean temperatures.

Comparing Models#

Now we will visually see with plotting a scatter plot with plotly which model is the best fit to predict ocean underwater temperatures.

#Comparing Mean Values First to see which one has more accuracy
print("Linear regression RMSE:", rmse_lin)
print("RandomForestRegression RMSE", rmse_rfr)
print("KNN regression RMSE:", rmse_knn)

Linear regression RMSE: 0.6389837292969595
RandomForestRegression RMSE 0.6390578648446843
KNN regression RMSE: 0.7402467899288725

# Visualizing actual vs predicted values for linear regression
plt.scatter(y_test, y_lin_pred, color='purple')
plt.plot(y_test, y_test, color='blue')
plt.title('Linear Regression')
plt.xlabel('Actual Temperature')
plt.ylabel('Predicted Temperature')
plt.show()

# Visualizing actual vs predicted values for random forest regression
plt.scatter(y_test, y_rfr_pred, color = 'purple')
plt.plot(y_test, y_test, color = 'blue')
plt.title('Random Forest Regression')
plt.xlabel('Actual Temperature')
plt.ylabel('Predicted Temperature')
plt.show()

# Visualizing the actual vs predicted values for KNN regression
plt.scatter(y_test, y_knn_pred, color='purple')
plt.plot(y_test, y_test, color='blue')
plt.title('KNN Regression')
plt.xlabel('Actual Temperature')
plt.ylabel('Predicted Temperature')
plt.show()

../../_images/835d14a47f746a1aa05012e194c1e46d6bd164060dcf696256eb1cf473821f50.png

../../_images/bf34740fc5e9657a75d29772bb2fe83a5591907ae701f7a6706cd4994d7315e3.png

../../_images/32ee67e6d82a837bf42e4bd7329a0d806f8c1a47650a9b6b13f9ddb3fb23d9c9.png

Summary#

By plotting the three scatter plots, we can see that the KNN Regression model has the most accurate predictions when it comes to predicting underwater ocean temperature based on location and depth. The Linear and Random Forest Regression models are very similiar and almost identical but when it comes to the accuracy in the “RSME” value, the Random Forest Regression predictions are more accurate than the Linear Regression Model. I think when it comes to predicting more speficic and non-linear relationship data, such as for the ocean underwater temperature, using KNN Regression models give more accuracy because it takes outliers and overfitting into consideration. Therefore, predicting ocean temperatures using historical data is more accurate when using KNN Regression models which will be helpful when understanding the changes in the ocean temperature to keep in mind how hot it is becoming before it becomes a natural disaster for the environment and marine life.

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)?

The source of my datasets from datasets that were uploaded to Kaggle:

Ocean temperature - https://www.kaggle.com/datasets/shivamb/underwater-surface-temperature-dataset

List any other references that you found helpful.

Some references that I found most helpful where:

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Deepnote

Making Predictions of Underwater Ocean Temperature Using Historical Data

Contents

Making Predictions of Underwater Ocean Temperature Using Historical Data#

Introduction#

Uploading and Cleaning up data#

Machine Learning#

Random Forest Regression#

Random Forest Regression#

K-Nearest Regression#

Comparing Models#

Summary#

References#

Submission#