Exoplanet Candidate Analysis

Author: Maya Drusinsky

Email: mdrusins@uci.edu

Course Project, UC Irvine, Math 10, S22

I’ve been coding since around June 2021. I had an internship at a research internship for the past 10ish months where my job was to use computer vision and some machine learning to analyze images that the researchers produced from their experiments. Most of my python knowledge involved OpenCV and matplotlib, but I also had some knowledge of Numpy, Seaborn, and Pandas. I taught myself coding (I had never taken a coding class until this quarter) so I was missing a lot of the basic knowledge of Python but I am pretty well versed in libraries we didn’t use as much in this class like OpenCV.

Introduction

exoplanet

This dataset contains approximately 10,000 exoplanet candidates discovered by NASA’s Kepler space telescope since it was launched. These are classified as false positives, candidates, or confirmed exoplanets. Along with their status, each object of interest has a column for its numerous descriptive factors (such as the source of its signal or its radius). A full description of all the columns and what they contain can be found here. The goal of my project is to use Machine Learning to use the features of these planets to try to accurately predict whether the planet’s disposition.

Main portion of the project

import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import plot_tree
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss
df = pd.read_csv("kepler_exoplanets.csv")
df.head()
rowid kepid kepoi_name kepler_name koi_disposition koi_pdisposition koi_score koi_fpflag_nt koi_fpflag_ss koi_fpflag_co ... koi_steff_err2 koi_slogg koi_slogg_err1 koi_slogg_err2 koi_srad koi_srad_err1 koi_srad_err2 ra dec koi_kepmag
0 1 10797460 K00752.01 Kepler-227 b CONFIRMED CANDIDATE 1.000 0 0 0 ... -81.0 4.467 0.064 -0.096 0.927 0.105 -0.061 291.93423 48.141651 15.347
1 2 10797460 K00752.02 Kepler-227 c CONFIRMED CANDIDATE 0.969 0 0 0 ... -81.0 4.467 0.064 -0.096 0.927 0.105 -0.061 291.93423 48.141651 15.347
2 3 10811496 K00753.01 NaN FALSE POSITIVE FALSE POSITIVE 0.000 0 1 0 ... -176.0 4.544 0.044 -0.176 0.868 0.233 -0.078 297.00482 48.134129 15.436
3 4 10848459 K00754.01 NaN FALSE POSITIVE FALSE POSITIVE 0.000 0 1 0 ... -174.0 4.564 0.053 -0.168 0.791 0.201 -0.067 285.53461 48.285210 15.597
4 5 10854555 K00755.01 Kepler-664 b CONFIRMED CANDIDATE 1.000 0 0 0 ... -211.0 4.438 0.070 -0.210 1.046 0.334 -0.133 288.75488 48.226200 15.509

5 rows × 50 columns

df.shape
(9564, 50)

Plotting the data

To get an idea of what the data looks like I will plot a variety of altair charts using different features from the dataframe. The dataframe has a lot of rows and data points in it, so I am taking a sample of 5000 points so the charts are a little more readable.

sel = alt.selection_single(fields=["koi_disposition"])

c1 = alt.Chart(df.sample(5000)).mark_bar().encode(
    x = "koi_disposition", 
    y = "count(koi_disposition)", 
    color = "koi_disposition"
).properties(
    title='Exoplanet Disposition')

c2 = alt.Chart(df.sample(5000)).mark_circle(clip = True).encode(
    x = alt.X("koi_impact", scale = alt.Scale(domain = [0, 20])), 
    y = alt.Y("koi_period", scale = alt.Scale(domain = [0, 1300])), 
    color = alt.Color("koi_disposition", scale=alt.Scale(scheme="magma")), 
    tooltip=["koi_disposition","koi_impact", "koi_score"]
).add_selection(sel).transform_filter(sel)

c3 = alt.Chart(df.sample(5000)).mark_bar().encode(
    x = "koi_disposition", 
    y = "mean(koi_insol)", 
    color = "koi_disposition"
)

c4 = alt.Chart(df.sample(5000)).mark_bar().encode(
    x = "koi_disposition", 
    y = "mean(koi_impact)", 
    color = "koi_disposition"
)

c5 = alt.Chart(df.sample(5000)).mark_circle(clip = True).encode(
    x = alt.X("koi_impact", scale = alt.Scale(domain = [0, 4])),
    y = "koi_score", 
    color = alt.Color("koi_disposition", scale=alt.Scale(scheme="magma")), 
    tooltip=["koi_disposition","koi_impact", "koi_score"]
).add_selection(sel).transform_filter(sel)
c1

Below are just some interesting charts to see how different planet dispositions are positioned based on different columns from the dataset.

alt.hconcat(c3, c4)
c2
c5

Logistic Regression

I want to make a Logistic Regression model using the columns that (based off of their NASA description) seem the most important.

cols = ["koi_disposition", "koi_score", "koi_period", "koi_impact", "koi_duration", "koi_depth", "koi_prad", "koi_insol"]
cols1 = ["koi_score", "koi_period", "koi_impact", "koi_duration", "koi_depth", "koi_prad", "koi_insol"]
df2 = df[cols].copy()
df2.head()
koi_disposition koi_score koi_period koi_impact koi_duration koi_depth koi_prad koi_insol
0 CONFIRMED 1.000 9.488036 0.146 2.95750 615.8 2.26 93.59
1 CONFIRMED 0.969 54.418383 0.586 4.50700 874.8 2.83 9.11
2 FALSE POSITIVE 0.000 19.899140 0.969 1.78220 10829.0 14.60 39.30
3 FALSE POSITIVE 0.000 1.736952 1.276 2.40641 8079.2 33.46 891.96
4 CONFIRMED 1.000 2.525592 0.701 1.65450 603.3 2.75 926.16
# Feature Engineering to get rid of any NaN values
for col in cols1:
    df2[col] = df2[col].fillna(df2[col].median())

To get the most accurate results from this model, I’m going to divide my data into test and training sets.

X_train, X_test, y_train, y_test = train_test_split(df2[cols1], df2["koi_disposition"], test_size=0.2, random_state = 0)
print(f"X_train has {len(X_train)} rows and X_test has {len(X_test)} rows.")
X_train has 7651 rows and X_test has 1913 rows.
# Instantiate
clf = LogisticRegression(max_iter = 5000, random_state = 0)

# Fit
clf.fit(X_train, y_train)
LogisticRegression(max_iter=5000, random_state=0)
# Find the score of the training and test sets to see how accurate my model is

print(f"The training score is {clf.score(X_train, y_train)}")
print(f"The test score is {clf.score(X_test, y_test)}")
The training score is 0.5248987060514966
The test score is 0.5295347621536853
# Add a predicted values column to my dataframe

df2["Pred"] = clf.predict(df2[cols1])
df2
koi_disposition koi_score koi_period koi_impact koi_duration koi_depth koi_prad koi_insol Pred
0 CONFIRMED 1.000 9.488036 0.146 2.95750 615.8 2.26 93.59 FALSE POSITIVE
1 CONFIRMED 0.969 54.418383 0.586 4.50700 874.8 2.83 9.11 FALSE POSITIVE
2 FALSE POSITIVE 0.000 19.899140 0.969 1.78220 10829.0 14.60 39.30 FALSE POSITIVE
3 FALSE POSITIVE 0.000 1.736952 1.276 2.40641 8079.2 33.46 891.96 FALSE POSITIVE
4 CONFIRMED 1.000 2.525592 0.701 1.65450 603.3 2.75 926.16 FALSE POSITIVE
... ... ... ... ... ... ... ... ... ...
9559 FALSE POSITIVE 0.000 8.589871 0.765 4.80600 87.7 1.11 176.40 FALSE POSITIVE
9560 FALSE POSITIVE 0.000 0.527699 1.252 3.22210 1579.2 29.35 4500.53 FALSE POSITIVE
9561 CANDIDATE 0.497 1.739849 0.043 3.11400 48.5 0.72 1585.81 FALSE POSITIVE
9562 FALSE POSITIVE 0.021 0.681402 0.147 0.86500 103.6 1.07 5713.41 FALSE POSITIVE
9563 FALSE POSITIVE 0.000 4.856035 0.134 3.07800 76.7 1.05 607.42 FALSE POSITIVE

9564 rows × 9 columns

Here we can see this model did a pretty bad job predicting planet disposition- it is predicting way too many planets as False Positives. I want to see if using Logistic Regression with more columns will improve the score of my model while also prevent overfitting.

cols2 = ["koi_score", "koi_fpflag_nt", "koi_fpflag_ss", "koi_fpflag_co", "koi_fpflag_ec", "koi_period", "koi_period_err1", "koi_period_err2", "koi_time0bk", "koi_time0bk_err1", "koi_time0bk_err2", "koi_impact", "koi_impact_err1",  "koi_impact_err2", "koi_duration", "koi_duration_err1", "koi_duration_err2", "koi_depth", "koi_depth_err1", "koi_depth_err2", "koi_prad", "koi_prad_err1", "koi_prad_err2", "koi_teq","koi_insol", "koi_insol_err1", "koi_insol_err2", "koi_model_snr", "koi_tce_plnt_num", "koi_steff", "koi_steff_err1", "koi_steff_err2", "koi_slogg", "koi_slogg_err1", "koi_slogg_err2", "koi_srad", "koi_srad_err1", "koi_srad_err2", "ra", "dec", "koi_kepmag"]
print(f"cols2 has {len(cols2)} columns in it.")
cols2 has 41 columns in it.

I am going to make another dataframe for this model to avoid getting my data or predictions mixed up. I’m going to eliminate all the empty values from this dataset as well.

df3 = df.copy()
for col in cols2:
    df3[col] = df3[col].fillna(df3[col].median())

Again, I am going to divide my data into training and test sets. These sets will be different than the last ones because I am using more columns this time.

X_train2, X_test2, y_train2, y_test2 = train_test_split(df3[cols2], df3["koi_disposition"], test_size=0.2, random_state = 0)

Here I am going to make a secong logistic regression model using my new training and test sets with more columns than the first one, and I hope this model will be more accurate since the first one was not very accurate.

# Instantiate
clf2 = LogisticRegression(max_iter = 5000)

# Fit
clf2.fit(X_train2, y_train2)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
LogisticRegression(max_iter=5000)
# Find the score of the training and test sets to see how accurate my model is

print(f"The training score is {clf2.score(X_train2, y_train2)}")
print(f"The test score is {clf2.score(X_test2, y_test2)}")
The training score is 0.6559926806953339
The test score is 0.6414009409304757

This model seems a little more accurate and since the training and test scores are so similar there does not seem to be overfitting. Now I want to add these predicted values to my new dataset I made earlier.

# Add a predicted values column to my dataframe

df3["Pred"] = clf2.predict(df3[cols2])

Feature Engineering

I want to display my Dataframe but there are some unnecessary columns. As we can see below, “kepoi_name”, “kepler_name”, “koi_disposition”, “koi_pdisposition”, “koi_tce_delivname”, and “Pred” are all columns with object data types. Out of these columns, only “koi_disposition” and “Pred” are important to us (because these are the true and predicted dispositions) so I want to get rid of the rest. “rowid” and “kepid” are also unnecessary because they are just ID numbers, so even though they are integers, they provide no useful information so I will delete those from the dataframe as well.

df3.dtypes
rowid                  int64
kepid                  int64
kepoi_name            object
kepler_name           object
koi_disposition       object
koi_pdisposition      object
koi_score            float64
koi_fpflag_nt          int64
koi_fpflag_ss          int64
koi_fpflag_co          int64
koi_fpflag_ec          int64
koi_period           float64
koi_period_err1      float64
koi_period_err2      float64
koi_time0bk          float64
koi_time0bk_err1     float64
koi_time0bk_err2     float64
koi_impact           float64
koi_impact_err1      float64
koi_impact_err2      float64
koi_duration         float64
koi_duration_err1    float64
koi_duration_err2    float64
koi_depth            float64
koi_depth_err1       float64
koi_depth_err2       float64
koi_prad             float64
koi_prad_err1        float64
koi_prad_err2        float64
koi_teq              float64
koi_teq_err1         float64
koi_teq_err2         float64
koi_insol            float64
koi_insol_err1       float64
koi_insol_err2       float64
koi_model_snr        float64
koi_tce_plnt_num     float64
koi_tce_delivname     object
koi_steff            float64
koi_steff_err1       float64
koi_steff_err2       float64
koi_slogg            float64
koi_slogg_err1       float64
koi_slogg_err2       float64
koi_srad             float64
koi_srad_err1        float64
koi_srad_err2        float64
ra                   float64
dec                  float64
koi_kepmag           float64
Pred                  object
dtype: object
# Feature Engineering to remove unnecessary columns from this dataframe before we display it

droplist = ["rowid", "kepid", "kepoi_name", "kepler_name", "koi_pdisposition", "koi_teq_err2", "koi_teq_err1", "koi_tce_delivname"]
print(f"Old shape: {df3.shape}")

[df3.drop(i, axis = 1, inplace = True) for i in droplist]

print(f"New shape: {df3.shape}")
Old shape: (9564, 51)
New shape: (9564, 43)
df3
koi_disposition koi_score koi_fpflag_nt koi_fpflag_ss koi_fpflag_co koi_fpflag_ec koi_period koi_period_err1 koi_period_err2 koi_time0bk ... koi_slogg koi_slogg_err1 koi_slogg_err2 koi_srad koi_srad_err1 koi_srad_err2 ra dec koi_kepmag Pred
0 CONFIRMED 1.000 0 0 0 0 9.488036 2.775000e-05 -2.775000e-05 170.538750 ... 4.467 0.064 -0.096 0.927 0.105 -0.061 291.93423 48.141651 15.347 CONFIRMED
1 CONFIRMED 0.969 0 0 0 0 54.418383 2.479000e-04 -2.479000e-04 162.513840 ... 4.467 0.064 -0.096 0.927 0.105 -0.061 291.93423 48.141651 15.347 CONFIRMED
2 FALSE POSITIVE 0.000 0 1 0 0 19.899140 1.494000e-05 -1.494000e-05 175.850252 ... 4.544 0.044 -0.176 0.868 0.233 -0.078 297.00482 48.134129 15.436 FALSE POSITIVE
3 FALSE POSITIVE 0.000 0 1 0 0 1.736952 2.630000e-07 -2.630000e-07 170.307565 ... 4.564 0.053 -0.168 0.791 0.201 -0.067 285.53461 48.285210 15.597 FALSE POSITIVE
4 CONFIRMED 1.000 0 0 0 0 2.525592 3.761000e-06 -3.761000e-06 171.595550 ... 4.438 0.070 -0.210 1.046 0.334 -0.133 288.75488 48.226200 15.509 FALSE POSITIVE
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9559 FALSE POSITIVE 0.000 0 0 0 1 8.589871 1.846000e-04 -1.846000e-04 132.016100 ... 4.296 0.231 -0.189 1.088 0.313 -0.228 298.74921 46.973351 14.478 FALSE POSITIVE
9560 FALSE POSITIVE 0.000 0 1 1 0 0.527699 1.160000e-07 -1.160000e-07 131.705093 ... 4.529 0.035 -0.196 0.903 0.237 -0.079 297.18875 47.093819 14.082 FALSE POSITIVE
9561 CANDIDATE 0.497 0 0 0 0 1.739849 1.780000e-05 -1.780000e-05 133.001270 ... 4.444 0.056 -0.224 1.031 0.341 -0.114 286.50937 47.163219 14.757 FALSE POSITIVE
9562 FALSE POSITIVE 0.021 0 0 1 0 0.681402 2.434000e-06 -2.434000e-06 132.181750 ... 4.447 0.056 -0.224 1.041 0.341 -0.114 294.16489 47.176281 15.385 FALSE POSITIVE
9563 FALSE POSITIVE 0.000 0 0 1 1 4.856035 6.356000e-05 -6.356000e-05 135.993300 ... 4.385 0.054 -0.216 1.193 0.410 -0.137 297.00977 47.121021 14.826 FALSE POSITIVE

9564 rows × 43 columns

This model did a little better on predicting the planet disposition but I think it could be even better. Now I’ll try it with a decision tree model to see if that works better.

Decision Tree

# Create a new DataFrame where the columns in cols2 have no empty values
df4 = df.copy()
for col in cols2:
    df4[col] = df4[col].fillna(df4[col].median())

To avoid confusion between models, I am going to create new test and training sets, and name this classifier clf_tree.

# Instantiate
clf_tree = DecisionTreeClassifier(max_depth = 15, max_leaf_nodes=18)

# Train and Test sets
X_train_tree, X_test_tree, y_train_tree, y_test_tree = train_test_split(df4[cols2], df4["koi_disposition"], test_size=0.2, random_state = 0)

# Fit the DecisionTreeClassifier
clf_tree.fit(X_train2, y_train2)
DecisionTreeClassifier(max_depth=15, max_leaf_nodes=18)
# Plot the Decision Tree
fig = plt.figure(figsize=(200,100))
_ = plot_tree(
    clf_tree, 
    feature_names = clf_tree.feature_names_in_, 
    class_names = clf_tree.classes_,
    filled = True
)
../../_images/MayaDrusinsky_49_0.png

I’m curious to see which columns are the most important to my decision tree model:

# Create a DataFrame of the columns ("features") and their importance in relation to 
# predictions by the Decision Tree
df_features = pd.DataFrame({"Importance": clf_tree.feature_importances_, "feature": clf_tree.feature_names_in_})
df_features
Importance feature
0 0.693282 koi_score
1 0.084338 koi_fpflag_nt
2 0.014428 koi_fpflag_ss
3 0.019486 koi_fpflag_co
4 0.000000 koi_fpflag_ec
5 0.000000 koi_period
6 0.006403 koi_period_err1
7 0.000000 koi_period_err2
8 0.003890 koi_time0bk
9 0.000000 koi_time0bk_err1
10 0.000000 koi_time0bk_err2
11 0.023268 koi_impact
12 0.000000 koi_impact_err1
13 0.000000 koi_impact_err2
14 0.009453 koi_duration
15 0.000000 koi_duration_err1
16 0.000000 koi_duration_err2
17 0.000000 koi_depth
18 0.000000 koi_depth_err1
19 0.000000 koi_depth_err2
20 0.015040 koi_prad
21 0.000000 koi_prad_err1
22 0.000000 koi_prad_err2
23 0.000000 koi_teq
24 0.000000 koi_insol
25 0.000000 koi_insol_err1
26 0.000000 koi_insol_err2
27 0.127646 koi_model_snr
28 0.000000 koi_tce_plnt_num
29 0.000000 koi_steff
30 0.000000 koi_steff_err1
31 0.000000 koi_steff_err2
32 0.000000 koi_slogg
33 0.000000 koi_slogg_err1
34 0.000000 koi_slogg_err2
35 0.000000 koi_srad
36 0.002767 koi_srad_err1
37 0.000000 koi_srad_err2
38 0.000000 ra
39 0.000000 dec
40 0.000000 koi_kepmag

As we can see in the above dataframe, the koi_score column has the most importance when it comes to predicting the disposition of a planet. This makes sense intuitively because according to NASA, candidates and confirmed exoplanets will have scores closer to 1 most of the time while false positives will mostly have scores closer to 0.

# Find the score of the training and test sets to see how accurate my model is

print(f"The training score is {clf_tree.score(X_train_tree, y_train_tree)}")
print(f"The test score is {clf_tree.score(X_test_tree, y_test_tree)}")
The training score is 0.8919095543066266
The test score is 0.8834291688447464

This model seems to do a much better job at predicting a planet’s disposition. The scores are also very similar between the training and testing sets which implies that there is not overfitting.

Using log loss to determine accuracy of model predictions

Now I want to use my three models (clf, clf2, and clf_tree) in log loss functions to get a more accurate read on how well each of these models work.

clf_train_log_loss = log_loss(y_train, clf.predict_proba(X_train))
clf_test_log_loss = log_loss(y_test, clf.predict_proba(X_test))

print(f"The log loss on the training set for my first Logistic Regression model is {clf_train_log_loss}\nThe log loss on the test set is {clf_train_log_loss}")

clf2_train_log_loss = log_loss(y_train2, clf2.predict_proba(X_train2))
clf2_test_log_loss = log_loss(y_test2, clf2.predict_proba(X_test2))

print(f"\nThe log loss on the training set for my second Logistic Regression model is {clf2_train_log_loss}\nThe log loss on the test set is {clf2_train_log_loss}")

clf_tree_train_log_loss = log_loss(y_train_tree, clf_tree.predict_proba(X_train_tree))
clf_tree_test_log_loss = log_loss(y_test_tree, clf_tree.predict_proba(X_test_tree))

print(f"\nThe log loss on the training set for my Decision Tree model is {clf_tree_train_log_loss}\nThe log loss on the test set is {clf_tree_test_log_loss}")
The log loss on the training set for my first Logistic Regression model is 0.8694039930140798
The log loss on the test set is 0.8694039930140798

The log loss on the training set for my second Logistic Regression model is 0.7641640238247506
The log loss on the test set is 0.7641640238247506

The log loss on the training set for my Decision Tree model is 0.26968882404713607
The log loss on the test set is 0.31605228969315485

From these log loss scores I can tell that the Decision Tree model is clearly the most accurate one. The log loss for both logistic regression models were really high (which I excpected based on how low their scores were but I was surprised by how high their log losses actually ended up being) and the log loss for the Decision Tree model was much lower. The log loss for the test set was definitely higher than for the training set which is a little interesting to me so I want to see if I can see exactly which number of max leaf nodes and max depth works best for the decision tree model to prevent underfitting and overfitting and get the lowest log loss. Additionally, I think the log losses for the Logistic regression models were identical between the training and test sets because my dataset is so large and the training and test sets may look very similar.

Here I am going to see which max_leaf_nodes number works best:

train_error_dict = {}
test_error_dict = {}

for n in range(2, 41):
    clf_tree2 = DecisionTreeClassifier(max_depth=15, max_leaf_nodes=n)
    clf_tree2.fit(X_train_tree, y_train_tree)
    train_error_dict[n] = log_loss(y_train_tree, clf_tree2.predict_proba(X_train_tree))
    test_error_dict[n] = log_loss(y_test_tree, clf_tree2.predict_proba(X_test_tree))
df_train = pd.DataFrame({"y":train_error_dict, "type": "train"})
df_test = pd.DataFrame({"y":test_error_dict, "type": "test"})

df_small = pd.concat([df_train, df_test]).reset_index()

alt.Chart(df_small).mark_line(clip=True).encode(
    x="index:O",
    y='y',
    color="type"
).properties(
    title='Log Loss for max_leaf_nodes of 2 to 40')
train_error_dict2 = {}
test_error_dict2 = {}

for n in range(2, 41):
    clf_tree3 = DecisionTreeClassifier(max_depth=n, max_leaf_nodes=18)
    clf_tree3.fit(X_train_tree, y_train_tree)
    train_error_dict2[n] = log_loss(y_train_tree, clf_tree3.predict_proba(X_train_tree))
    test_error_dict2[n] = log_loss(y_test_tree, clf_tree3.predict_proba(X_test_tree))
df_train2 = pd.DataFrame({"y":train_error_dict2, "type": "train"})
df_test2 = pd.DataFrame({"y":test_error_dict2, "type": "test"})

df_small2 = pd.concat([df_train2, df_test2]).reset_index()

alt.Chart(df_small2).mark_line(clip=True).encode(
    x="index:O",
    y='y',
    color="type"
).properties(
    title='Log Loss for max_depth of 2 to 40')

By both of these train and test error curve plots, it seems like there are a lot of max_leaf_nodes and max_depths that work well for the Decision Tree model. The model I was using originally with max_leaf_nodes 18 and max_depth 15 is definitely in the range of parameters that, according to these plots, work the best and prevent over/underfitting.

K - Nearest Neighbors

I am going to try an use K-Nearest Neighbors to try and predict dispositions, using the same train test split that I used for the Decision Tree.

train_error_dict3 = {}
test_error_dict3 = {}

for k in range(10, 2001, 100):
    clf_k = KNeighborsClassifier(n_neighbors=k)
    clf_k.fit(X_train_tree, y_train_tree)
    train_error_dict3[k] = log_loss(y_train_tree, clf_k.predict_proba(X_train_tree))
    test_error_dict3[k] = log_loss(y_test_tree, clf_k.predict_proba(X_test_tree))
df_train3 = pd.DataFrame({"y":train_error_dict3, "type": "train"})
df_test3 = pd.DataFrame({"y":test_error_dict3, "type": "test"})

# Create a combined dataframe, adding a column of 1/k neighbors so the following plot looks more accurate
df_small3 = pd.concat([df_train3, df_test3]).reset_index()
df_small3.rename(columns = {"index": "K- neighbors"}, inplace = True)
df_small3["1/k"] = df_small3["K- neighbors"].map(lambda x: 1/x)

alt.Chart(df_small3).mark_line(clip=True).encode(
    x="1/k:Q",
    y='y',
    color="type"
).properties(
    title='Log Loss for k-neighbors of 10 to 2000')

This train and test error curve plot shows that the best k_neighbors to use is a very high number of k, such as between 100 to 200. Anything over that or under that range will most likely be underfitting or over fitting, which I want to avoid.

clf_k_neighbors = KNeighborsClassifier(n_neighbors=150)
clf_k_neighbors.fit(X_train_tree, y_train_tree)
KNeighborsClassifier(n_neighbors=150)
df5 = df.copy()
for col in cols2:
    df5[col] = df5[col].fillna(df5[col].median())
df5["Pred"] = clf_k_neighbors.predict(df5[cols2])
df5
rowid kepid kepoi_name kepler_name koi_disposition koi_pdisposition koi_score koi_fpflag_nt koi_fpflag_ss koi_fpflag_co ... koi_slogg koi_slogg_err1 koi_slogg_err2 koi_srad koi_srad_err1 koi_srad_err2 ra dec koi_kepmag Pred
0 1 10797460 K00752.01 Kepler-227 b CONFIRMED CANDIDATE 1.000 0 0 0 ... 4.467 0.064 -0.096 0.927 0.105 -0.061 291.93423 48.141651 15.347 CONFIRMED
1 2 10797460 K00752.02 Kepler-227 c CONFIRMED CANDIDATE 0.969 0 0 0 ... 4.467 0.064 -0.096 0.927 0.105 -0.061 291.93423 48.141651 15.347 CONFIRMED
2 3 10811496 K00753.01 NaN FALSE POSITIVE FALSE POSITIVE 0.000 0 1 0 ... 4.544 0.044 -0.176 0.868 0.233 -0.078 297.00482 48.134129 15.436 FALSE POSITIVE
3 4 10848459 K00754.01 NaN FALSE POSITIVE FALSE POSITIVE 0.000 0 1 0 ... 4.564 0.053 -0.168 0.791 0.201 -0.067 285.53461 48.285210 15.597 FALSE POSITIVE
4 5 10854555 K00755.01 Kepler-664 b CONFIRMED CANDIDATE 1.000 0 0 0 ... 4.438 0.070 -0.210 1.046 0.334 -0.133 288.75488 48.226200 15.509 FALSE POSITIVE
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9559 9560 10031643 K07984.01 NaN FALSE POSITIVE FALSE POSITIVE 0.000 0 0 0 ... 4.296 0.231 -0.189 1.088 0.313 -0.228 298.74921 46.973351 14.478 CANDIDATE
9560 9561 10090151 K07985.01 NaN FALSE POSITIVE FALSE POSITIVE 0.000 0 1 1 ... 4.529 0.035 -0.196 0.903 0.237 -0.079 297.18875 47.093819 14.082 FALSE POSITIVE
9561 9562 10128825 K07986.01 NaN CANDIDATE CANDIDATE 0.497 0 0 0 ... 4.444 0.056 -0.224 1.031 0.341 -0.114 286.50937 47.163219 14.757 FALSE POSITIVE
9562 9563 10147276 K07987.01 NaN FALSE POSITIVE FALSE POSITIVE 0.021 0 0 1 ... 4.447 0.056 -0.224 1.041 0.341 -0.114 294.16489 47.176281 15.385 FALSE POSITIVE
9563 9564 10156110 K07989.01 NaN FALSE POSITIVE FALSE POSITIVE 0.000 0 0 1 ... 4.385 0.054 -0.216 1.193 0.410 -0.137 297.00977 47.121021 14.826 CANDIDATE

9564 rows × 51 columns

print(f"The score on the training set is {clf_k_neighbors.score(X_train_tree, y_train_tree)}\nThe score on the test set is {clf_k_neighbors.score(X_test_tree, y_test_tree)}")
The score on the training set is 0.6535093451836361
The score on the test set is 0.6340825927861997

These scores are not very high, but they suggest that there isn’t any overfitting because the training and test scores are so similar.

Summary

In this project I’ve used two logistic regression models (using different sizes of data from my dataset), a Decision Tree model, and a K- Nearest neighbors model to try and predict exoplanet dispositions from the data in the NASA dataset. I fit each model with training and test sets from the dataset, and calculated scores and log loss to determine the best model. To me, it seems like the Decision Tree model is the superior one because it has the highest scores and lowest log loss, even compared to K- nearest neighbors. I also performed some feature engineering and plotting using altair to get an idea of which models are best and what the data looks like.

References

  • What is the source of your dataset(s)?

https://www.kaggle.com/datasets/nasa/kepler-exoplanet-search-results?resource=download

  • Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

  1. I modified code from here to change the color schemes for some of my altair charts.

  2. I modified code from here to add titles to my charts.

The rest of my code that was modified from elsewhere came from lectures (from this quarter and winter quarter) or posted videos from this quarter.

  • List other references that you found helpful.

  1. https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761

  2. https://monkeylearn.com/blog/classification-algorithms/

  3. https://www.edureka.co/blog/classification-in-machine-learning/#:~:text=In machine learning, classification is,recognition, document classification, etc.

  4. https://altair-viz.github.io/user_guide/generated/core/altair.Scale.html

  5. https://altair-viz.github.io/user_guide/generated/toplevel/altair.HConcatChart.html#altair.HConcatChart

  6. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Created in deepnote.com Created in Deepnote