Sorting the number of planets

Author: Ningyue Xu

Course Project, UC Irvine, Math 10, S22

Introduction

Introduce your project here. Maybe 3 sentences.

In the analysis of this project, we tried to analyze some variables of this planet dataset. We tried to classify them in various ways. After sorting “method”, we used various methods to try to sort out “numbers” based on other variables

Main portion of the project

(You can either have all one section or divide into multiple sections)

First, we import the library needed for this project, import the seaborn database “planets”, and handle the lost data.

import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor
from sklearn.metrics import log_loss, mean_squared_error,mean_absolute_error
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df=sns.load_dataset("planets").dropna()
df
method number orbital_period mass distance year
0 Radial Velocity 1 269.30000 7.100 77.40 2006
1 Radial Velocity 1 874.77400 2.210 56.95 2008
2 Radial Velocity 1 763.00000 2.600 19.84 2011
3 Radial Velocity 1 326.03000 19.400 110.62 2007
4 Radial Velocity 1 516.22000 10.500 119.47 2009
... ... ... ... ... ... ...
640 Radial Velocity 1 111.70000 2.100 14.90 2009
641 Radial Velocity 1 5.05050 1.068 44.46 2013
642 Radial Velocity 1 311.28800 1.940 17.24 1999
649 Transit 1 2.70339 1.470 178.00 2013
784 Radial Velocity 3 580.00000 0.947 135.00 2012

498 rows × 6 columns

In this data set, I want to try to see if other variables will be related to “method”, that is, I want to classify “method”.

numcols = [c for c in df.columns if c != "method"]
kmeans = KMeans(n_clusters=2)
kmeans.fit(df[numcols])
df["cluster"] = kmeans.predict(df[numcols])
alt.Chart(df).mark_circle().encode(
    x = "mass",
    y = "distance",
    color = "method:N",
    tooltip = ["cluster","number"]
).interactive()

Due to the small number of transit samples, I initially intended to use “add_selection” to aid in the analysis, but the results still did not meet my needs

brush = alt.selection_interval()
c1=alt.Chart(df).mark_circle().encode(
    x = "mass",
    y = "distance",
    color = "method:N",
    tooltip = ["cluster","number"]
).add_selection(brush)
c2=alt.Chart(df).mark_circle().encode(
    x = "mass",
    y = "distance",
    color = "method:N"
).transform_filter(brush)
c1|c2

At the beginning of the production, I had intended to use different x,y to analyze what the “method” was classified according to. However, after the first graph was shown, I abandoned this aspect of the idea because the “methods” were divided into two categories and the distribution was apparently too unbalanced and the analysis of methods was apparently less meaningful because it was too random. Therefore, I decided to discard the variable “method” and move to a more balanced analysis of the data. I noticed that “numbers” are naturally divided into six categories, so I wanted to try if cluster could show the classification of “numbers”.

kmeans2 = KMeans(n_clusters=6)
kmeans2.fit(df[numcols])
df["cluster2"] = kmeans2.predict(df[numcols])
chart_list = []

for c in numcols:
    chart = alt.Chart(df).mark_circle().encode(
        x = "mass",
        y = c,
        color = "cluster2:N"
    )
    chart_list.append(chart)
alt.vconcat(*chart_list)

When we look at the data, we can be surprised to find that orbital period and mass are clearly correlated with the subcluster. In other words, when we classify the data, orbital period is an important basis for classification. This also shows that cluster is not very intuitive in dividing the number into six categories.

Considering that orbital itself has large data, which may affect the classification due to this situation, let’s re-analyze this related situation with orbital as the x-axis

chart_list2 = []

for c in numcols:
    chart = alt.Chart(df).mark_circle().encode(
        x = "orbital_period",
        y = c,
        color = "cluster2:N"
    )
    chart_list2.append(chart)
alt.vconcat(*chart_list2)

We find that orbital appears significantly clustered for each variable. at this point, we can realize that the impact on the classification caused by the excessive size of the orbital data itself is significant. Therefore, we need to take rescaling of the data

df2 = df.copy()
scaler = StandardScaler(with_mean=True, with_std=False)
scaler.fit(df[numcols])
df2[numcols] = scaler.transform(df[numcols])

Then we repeat the steps above. Knowing the extreme imbalance of the data in the method itself, we do not analyze the method any more, but mainly in the process of cluster and number

kmeans3 = KMeans(n_clusters=6)
kmeans3.fit(df2[numcols])
df2["cluster2"] = kmeans3.predict(df2[numcols])
chart_list3 = []

for c in numcols:
    chart = alt.Chart(df2).mark_circle().encode(
        x = "mass",
        y = c,
        color = "cluster2:N"
    )
    chart_list3.append(chart)
alt.vconcat(*chart_list3)
chart_list4 = []

for c in numcols:
    chart = alt.Chart(df2).mark_circle().encode(
        x = "orbital_period",
        y = c,
        color = "cluster2:N"
    )
    chart_list4.append(chart)
alt.vconcat(*chart_list4)

At this point, we can already confirm very clearly the validity of orbital for classification. Moreover, it is clear that this classification has little to do with “number”.

After this, I want to use KNeighborsClassifier for predicting the number of planets

df2
method number orbital_period mass distance year cluster cluster2
0 Radial Velocity -0.73494 -566.478671 4.59068 25.331787 -1.37751 1 0
1 Radial Velocity -0.73494 38.995329 -0.29932 4.881787 0.62249 1 4
2 Radial Velocity -0.73494 -72.778671 0.09068 -32.228213 3.62249 1 4
3 Radial Velocity -0.73494 -509.748671 16.89068 58.551787 -0.37751 1 0
4 Radial Velocity -0.73494 -319.558671 7.99068 67.401787 1.62249 1 0
... ... ... ... ... ... ... ... ...
640 Radial Velocity -0.73494 -724.078671 -0.40932 -37.168213 1.62249 1 0
641 Radial Velocity -0.73494 -830.728171 -1.44132 -7.608213 5.62249 1 0
642 Radial Velocity -0.73494 -524.490671 -0.56932 -34.828213 -8.37751 1 0
649 Transit -0.73494 -833.075280 -1.03932 125.931787 5.62249 1 0
784 Radial Velocity 1.26506 -255.778671 -1.56232 82.931787 4.62249 1 4

498 rows × 8 columns

X = df2.iloc[:,2:]
y = df["number"]
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

We divide the variables into train and test, and use random state to fix the array for each random selection

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, train_size=0.6, random_state=4)
clf = KNeighborsClassifier( )
clf.fit(X_train, y_train)
KNeighborsClassifier()
df2['predict_K'] = clf.predict(X_scaled)
(df["number"] == df2["predict_K"]).value_counts()
True     338
False    160
dtype: int64

Here, a relatively large gap appears between the predicted and real data. Therefore, we want to consider the case of overfitting and underfitting by using kneighbors regressor

def get_scores(k):
    reg = KNeighborsRegressor(n_neighbors=k)
    reg.fit(X_train, y_train)
    train_error = mean_absolute_error(reg.predict(X_train), y_train)
    test_error = mean_absolute_error(reg.predict(X_test), y_test)
    return (train_error, test_error)
df2_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})
for i in df2_scores.index:
    df2_scores.loc[i,["train_error","test_error"]] = get_scores(df2_scores.loc[i,"k"])
df2_scores["k"] = df2_scores.k
ctrain = alt.Chart(df2_scores).mark_line().encode(
    x = "k",
    y = "train_error"
)
ctest = alt.Chart(df2_scores).mark_line(color="orange").encode(
    x = "k",
    y = "test_error"
)
ctrain+ctest

Here, we note that larger k values appear to be underfitting, while smaller k values appear to be overfitting, and it would be more appropriate to choose k values between 40 and 80. When we rescale the number of k, we observe whether the k Neighbors classifier can predict more accurately

clf2 = KNeighborsClassifier(n_neighbors=45)
clf2.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=45)
df2['predict_K2'] = clf2.predict(X_scaled)
(df["number"] == df2["predict_K2"]).value_counts()
True     313
False    185
dtype: int64

At this point, we will find that this prediction result still has a large gap, which means that overfitting and underfitting are not the main reasons for the prediction error.

At this point we went back to the data for analysis, and I realized that the data sets of year (year of discovery) and distance (distance from Earth) did not seem to reflect the number of planets intuitively, so the problem with the previous analysis might be that some of the data did not relate to “numbers”. Therefore, we remove year and distance and re-run the codes before.

X2 = df2.iloc[:,3:4]
y = df["number"]
scaler = StandardScaler()
scaler.fit(X2)
X2_scaled = scaler.transform(X2)
X2_train, X2_test, y_train, y_test = train_test_split(X2_scaled, y, train_size=0.6, random_state=4)
clf = KNeighborsClassifier( )
clf.fit(X2_train, y_train)
KNeighborsClassifier()
df2['predict_K'] = clf.predict(X2_scaled)
(df["number"] == df2["predict_K"]).value_counts()
True     314
False    184
dtype: int64

After replacing the ranges, we found that this classification still did not work, and we seemed to have to admit that “number” did not seem to have much to do with the other variables.

Next, we analyze the correlation between the variables.

corr = df.corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(df.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(df.columns)
ax.set_yticklabels(df.columns)
plt.show()
../../_images/NingyueXu_57_0.png

In this set of plots, we can find that “method” is not very correlated with each data, while the most correlated data with “number” is “year”. therefore, we incorrectly estimated the correlation between different variables before

We can now try to see if there is a linear relationship between “year” and “number”

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(df2[["year"]], df2["number"])
df2["Pred"] = reg.predict(df2[["year"]])
c = alt.Chart(df2).mark_circle().encode(
    x="year",
    y="number"
)
c1 = alt.Chart(df2).mark_line(color="red").encode(
    x="year",
    y="Pred"
)
c+c1

When we analyzed it, we found that this line was more different from what we thought it would be. Therefore, we try polynomial regression.

df2_train, df2_test = train_test_split(df2, train_size=0.4)
cols = []
for deg in range(1,10):
    c = f"d{deg}"
    cols.append(c)
    df2_train[c] = df2_train["year"]**deg
reg.fit(df2_train[cols], df2_train["number"])
LinearRegression()
df2_train["Pred2"] = reg.predict(df2_train[cols])
c = alt.Chart(df2_train).mark_circle().encode(
    x="year",
    y="number"
)

c9 = alt.Chart(df2_train).mark_line(color="red").encode(
    x="year",
    y="Pred2"
)

c+c9

After the analysis, we found that the polynomial form was more in line with our idea.

Summary

It is difficult to classify the more unbalanced data (“method” in this dataset) directly. And the classification of other variables requires a better determination of whether the different variables are correlated or not. In the present data, “number” correlates poorly with many variables, making it difficult to predict accurately.

References

  • What is the source of your dataset(s)? Seaborn “planet”

Created in deepnote.com Created in Deepnote