Exoplanet Candidate Analysis¶

Author: Maya Drusinsky

Course Project, UC Irvine, Math 10, S22

I’ve been coding since around June 2021. I had an internship at a research internship for the past 10ish months where my job was to use computer vision and some machine learning to analyze images that the researchers produced from their experiments. Most of my python knowledge involved OpenCV and matplotlib, but I also had some knowledge of Numpy, Seaborn, and Pandas. I taught myself coding (I had never taken a coding class until this quarter) so I was missing a lot of the basic knowledge of Python but I am pretty well versed in libraries we didn’t use as much in this class like OpenCV.

Introduction¶

exoplanet

This dataset contains approximately 10,000 exoplanet candidates discovered by NASA’s Kepler space telescope since it was launched. These are classified as false positives, candidates, or confirmed exoplanets. Along with their status, each object of interest has a column for its numerous descriptive factors (such as the source of its signal or its radius). A full description of all the columns and what they contain can be found here. The goal of my project is to use Machine Learning to use the features of these planets to try to accurately predict whether the planet’s disposition.

Main portion of the project¶

import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import plot_tree
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss

df = pd.read_csv("kepler_exoplanets.csv")

df.head()

	rowid	kepid	kepoi_name	kepler_name	koi_disposition	koi_pdisposition	koi_score	koi_fpflag_ss	...	koi_steff_err2	koi_slogg	koi_slogg_err1	koi_slogg_err2	koi_srad	koi_srad_err1	koi_srad_err2	ra	dec	koi_kepmag
0	1	10797460	K00752.01	Kepler-227 b	CONFIRMED	CANDIDATE	1.000	0	...	-81.0	4.467	0.064	-0.096	0.927	0.105	-0.061	291.93423	48.141651	15.347
1	2	10797460	K00752.02	Kepler-227 c	CONFIRMED	CANDIDATE	0.969	0	...	-81.0	4.467	0.064	-0.096	0.927	0.105	-0.061	291.93423	48.141651	15.347
2	3	10811496	K00753.01	NaN	FALSE POSITIVE	FALSE POSITIVE	0.000	1	...	-176.0	4.544	0.044	-0.176	0.868	0.233	-0.078	297.00482	48.134129	15.436
3	4	10848459	K00754.01	NaN	FALSE POSITIVE	FALSE POSITIVE	0.000	1	...	-174.0	4.564	0.053	-0.168	0.791	0.201	-0.067	285.53461	48.285210	15.597
4	5	10854555	K00755.01	Kepler-664 b	CONFIRMED	CANDIDATE	1.000	0	...	-211.0	4.438	0.070	-0.210	1.046	0.334	-0.133	288.75488	48.226200	15.509

5 rows × 50 columns

df.shape

(9564, 50)

Plotting the data¶

To get an idea of what the data looks like I will plot a variety of altair charts using different features from the dataframe. The dataframe has a lot of rows and data points in it, so I am taking a sample of 5000 points so the charts are a little more readable.

sel = alt.selection_single(fields=["koi_disposition"])

c1 = alt.Chart(df.sample(5000)).mark_bar().encode(
    x = "koi_disposition", 
    y = "count(koi_disposition)", 
    color = "koi_disposition"
).properties(
    title='Exoplanet Disposition')

c2 = alt.Chart(df.sample(5000)).mark_circle(clip = True).encode(
    x = alt.X("koi_impact", scale = alt.Scale(domain = [0, 20])), 
    y = alt.Y("koi_period", scale = alt.Scale(domain = [0, 1300])), 
    color = alt.Color("koi_disposition", scale=alt.Scale(scheme="magma")), 
    tooltip=["koi_disposition","koi_impact", "koi_score"]
).add_selection(sel).transform_filter(sel)

c3 = alt.Chart(df.sample(5000)).mark_bar().encode(
    x = "koi_disposition", 
    y = "mean(koi_insol)", 
    color = "koi_disposition"
)

c4 = alt.Chart(df.sample(5000)).mark_bar().encode(
    x = "koi_disposition", 
    y = "mean(koi_impact)", 
    color = "koi_disposition"
)

c5 = alt.Chart(df.sample(5000)).mark_circle(clip = True).encode(
    x = alt.X("koi_impact", scale = alt.Scale(domain = [0, 4])),
    y = "koi_score", 
    color = alt.Color("koi_disposition", scale=alt.Scale(scheme="magma")), 
    tooltip=["koi_disposition","koi_impact", "koi_score"]
).add_selection(sel).transform_filter(sel)

c1

Below are just some interesting charts to see how different planet dispositions are positioned based on different columns from the dataset.

alt.hconcat(c3, c4)

c2

c5

Logistic Regression¶

I want to make a Logistic Regression model using the columns that (based off of their NASA description) seem the most important.

cols = ["koi_disposition", "koi_score", "koi_period", "koi_impact", "koi_duration", "koi_depth", "koi_prad", "koi_insol"]
cols1 = ["koi_score", "koi_period", "koi_impact", "koi_duration", "koi_depth", "koi_prad", "koi_insol"]

df2 = df[cols].copy()

df2.head()

	koi_disposition	koi_score	koi_period	koi_impact	koi_duration	koi_depth	koi_prad	koi_insol
0	CONFIRMED	1.000	9.488036	0.146	2.95750	615.8	2.26	93.59
1	CONFIRMED	0.969	54.418383	0.586	4.50700	874.8	2.83	9.11
2	FALSE POSITIVE	0.000	19.899140	0.969	1.78220	10829.0	14.60	39.30
3	FALSE POSITIVE	0.000	1.736952	1.276	2.40641	8079.2	33.46	891.96
4	CONFIRMED	1.000	2.525592	0.701	1.65450	603.3	2.75	926.16

# Feature Engineering to get rid of any NaN values
for col in cols1:
    df2[col] = df2[col].fillna(df2[col].median())

To get the most accurate results from this model, I’m going to divide my data into test and training sets.

X_train, X_test, y_train, y_test = train_test_split(df2[cols1], df2["koi_disposition"], test_size=0.2, random_state = 0)

print(f"X_train has {len(X_train)} rows and X_test has {len(X_test)} rows.")

X_train has 7651 rows and X_test has 1913 rows.

# Instantiate
clf = LogisticRegression(max_iter = 5000, random_state = 0)

# Fit
clf.fit(X_train, y_train)

LogisticRegression(max_iter=5000, random_state=0)

# Find the score of the training and test sets to see how accurate my model is

print(f"The training score is {clf.score(X_train, y_train)}")
print(f"The test score is {clf.score(X_test, y_test)}")

The training score is 0.5248987060514966
The test score is 0.5295347621536853

# Add a predicted values column to my dataframe

df2["Pred"] = clf.predict(df2[cols1])

df2

	koi_disposition	koi_score	koi_period	koi_impact	koi_duration	koi_depth	koi_prad	koi_insol	Pred
0	CONFIRMED	1.000	9.488036	0.146	2.95750	615.8	2.26	93.59	FALSE POSITIVE
1	CONFIRMED	0.969	54.418383	0.586	4.50700	874.8	2.83	9.11	FALSE POSITIVE
2	FALSE POSITIVE	0.000	19.899140	0.969	1.78220	10829.0	14.60	39.30	FALSE POSITIVE
3	FALSE POSITIVE	0.000	1.736952	1.276	2.40641	8079.2	33.46	891.96	FALSE POSITIVE
4	CONFIRMED	1.000	2.525592	0.701	1.65450	603.3	2.75	926.16	FALSE POSITIVE
...	...	...	...	...	...	...	...	...	...
9559	FALSE POSITIVE	0.000	8.589871	0.765	4.80600	87.7	1.11	176.40	FALSE POSITIVE
9560	FALSE POSITIVE	0.000	0.527699	1.252	3.22210	1579.2	29.35	4500.53	FALSE POSITIVE
9561	CANDIDATE	0.497	1.739849	0.043	3.11400	48.5	0.72	1585.81	FALSE POSITIVE
9562	FALSE POSITIVE	0.021	0.681402	0.147	0.86500	103.6	1.07	5713.41	FALSE POSITIVE
9563	FALSE POSITIVE	0.000	4.856035	0.134	3.07800	76.7	1.05	607.42	FALSE POSITIVE

9564 rows × 9 columns

Here we can see this model did a pretty bad job predicting planet disposition- it is predicting way too many planets as False Positives. I want to see if using Logistic Regression with more columns will improve the score of my model while also prevent overfitting.

cols2 = ["koi_score", "koi_fpflag_nt", "koi_fpflag_ss", "koi_fpflag_co", "koi_fpflag_ec", "koi_period", "koi_period_err1", "koi_period_err2", "koi_time0bk", "koi_time0bk_err1", "koi_time0bk_err2", "koi_impact", "koi_impact_err1",  "koi_impact_err2", "koi_duration", "koi_duration_err1", "koi_duration_err2", "koi_depth", "koi_depth_err1", "koi_depth_err2", "koi_prad", "koi_prad_err1", "koi_prad_err2", "koi_teq","koi_insol", "koi_insol_err1", "koi_insol_err2", "koi_model_snr", "koi_tce_plnt_num", "koi_steff", "koi_steff_err1", "koi_steff_err2", "koi_slogg", "koi_slogg_err1", "koi_slogg_err2", "koi_srad", "koi_srad_err1", "koi_srad_err2", "ra", "dec", "koi_kepmag"]
print(f"cols2 has {len(cols2)} columns in it.")

cols2 has 41 columns in it.

I am going to make another dataframe for this model to avoid getting my data or predictions mixed up. I’m going to eliminate all the empty values from this dataset as well.

df3 = df.copy()

for col in cols2:
    df3[col] = df3[col].fillna(df3[col].median())

Again, I am going to divide my data into training and test sets. These sets will be different than the last ones because I am using more columns this time.

X_train2, X_test2, y_train2, y_test2 = train_test_split(df3[cols2], df3["koi_disposition"], test_size=0.2, random_state = 0)

Here I am going to make a secong logistic regression model using my new training and test sets with more columns than the first one, and I hope this model will be more accurate since the first one was not very accurate.

# Instantiate
clf2 = LogisticRegression(max_iter = 5000)

# Fit
clf2.fit(X_train2, y_train2)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,

LogisticRegression(max_iter=5000)

# Find the score of the training and test sets to see how accurate my model is

print(f"The training score is {clf2.score(X_train2, y_train2)}")
print(f"The test score is {clf2.score(X_test2, y_test2)}")

The training score is 0.6559926806953339
The test score is 0.6414009409304757

This model seems a little more accurate and since the training and test scores are so similar there does not seem to be overfitting. Now I want to add these predicted values to my new dataset I made earlier.

# Add a predicted values column to my dataframe

df3["Pred"] = clf2.predict(df3[cols2])

Feature Engineering¶

I want to display my Dataframe but there are some unnecessary columns. As we can see below, “kepoi_name”, “kepler_name”, “koi_disposition”, “koi_pdisposition”, “koi_tce_delivname”, and “Pred” are all columns with object data types. Out of these columns, only “koi_disposition” and “Pred” are important to us (because these are the true and predicted dispositions) so I want to get rid of the rest. “rowid” and “kepid” are also unnecessary because they are just ID numbers, so even though they are integers, they provide no useful information so I will delete those from the dataframe as well.

df3.dtypes

rowid                  int64
kepid                  int64
kepoi_name            object
kepler_name           object
koi_disposition       object
koi_pdisposition      object
koi_score            float64
koi_fpflag_nt          int64
koi_fpflag_ss          int64
koi_fpflag_co          int64
koi_fpflag_ec          int64
koi_period           float64
koi_period_err1      float64
koi_period_err2      float64
koi_time0bk          float64
koi_time0bk_err1     float64
koi_time0bk_err2     float64
koi_impact           float64
koi_impact_err1      float64
koi_impact_err2      float64
koi_duration         float64
koi_duration_err1    float64
koi_duration_err2    float64
koi_depth            float64
koi_depth_err1       float64
koi_depth_err2       float64
koi_prad             float64
koi_prad_err1        float64
koi_prad_err2        float64
koi_teq              float64
koi_teq_err1         float64
koi_teq_err2         float64
koi_insol            float64
koi_insol_err1       float64
koi_insol_err2       float64
koi_model_snr        float64
koi_tce_plnt_num     float64
koi_tce_delivname     object
koi_steff            float64
koi_steff_err1       float64
koi_steff_err2       float64
koi_slogg            float64
koi_slogg_err1       float64
koi_slogg_err2       float64
koi_srad             float64
koi_srad_err1        float64
koi_srad_err2        float64
ra                   float64
dec                  float64
koi_kepmag           float64
Pred                  object
dtype: object

# Feature Engineering to remove unnecessary columns from this dataframe before we display it

droplist = ["rowid", "kepid", "kepoi_name", "kepler_name", "koi_pdisposition", "koi_teq_err2", "koi_teq_err1", "koi_tce_delivname"]
print(f"Old shape: {df3.shape}")

[df3.drop(i, axis = 1, inplace = True) for i in droplist]

print(f"New shape: {df3.shape}")

Old shape: (9564, 51)
New shape: (9564, 43)

df3

	koi_disposition	koi_score	koi_fpflag_nt	koi_fpflag_ss	koi_fpflag_co	koi_fpflag_ec	koi_period	koi_period_err1	koi_period_err2	koi_time0bk	...	koi_slogg	koi_slogg_err1	koi_slogg_err2	koi_srad	koi_srad_err1	koi_srad_err2	ra	dec	koi_kepmag	Pred
0	CONFIRMED	1.000	0	0	0	0	9.488036	2.775000e-05	-2.775000e-05	170.538750	...	4.467	0.064	-0.096	0.927	0.105	-0.061	291.93423	48.141651	15.347	CONFIRMED
1	CONFIRMED	0.969	0	0	0	0	54.418383	2.479000e-04	-2.479000e-04	162.513840	...	4.467	0.064	-0.096	0.927	0.105	-0.061	291.93423	48.141651	15.347	CONFIRMED
2	FALSE POSITIVE	0.000	0	1	0	0	19.899140	1.494000e-05	-1.494000e-05	175.850252	...	4.544	0.044	-0.176	0.868	0.233	-0.078	297.00482	48.134129	15.436	FALSE POSITIVE
3	FALSE POSITIVE	0.000	0	1	0	0	1.736952	2.630000e-07	-2.630000e-07	170.307565	...	4.564	0.053	-0.168	0.791	0.201	-0.067	285.53461	48.285210	15.597	FALSE POSITIVE
4	CONFIRMED	1.000	0	0	0	0	2.525592	3.761000e-06	-3.761000e-06	171.595550	...	4.438	0.070	-0.210	1.046	0.334	-0.133	288.75488	48.226200	15.509	FALSE POSITIVE
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9559	FALSE POSITIVE	0.000	0	0	0	1	8.589871	1.846000e-04	-1.846000e-04	132.016100	...	4.296	0.231	-0.189	1.088	0.313	-0.228	298.74921	46.973351	14.478	FALSE POSITIVE
9560	FALSE POSITIVE	0.000	0	1	1	0	0.527699	1.160000e-07	-1.160000e-07	131.705093	...	4.529	0.035	-0.196	0.903	0.237	-0.079	297.18875	47.093819	14.082	FALSE POSITIVE
9561	CANDIDATE	0.497	0	0	0	0	1.739849	1.780000e-05	-1.780000e-05	133.001270	...	4.444	0.056	-0.224	1.031	0.341	-0.114	286.50937	47.163219	14.757	FALSE POSITIVE
9562	FALSE POSITIVE	0.021	0	0	1	0	0.681402	2.434000e-06	-2.434000e-06	132.181750	...	4.447	0.056	-0.224	1.041	0.341	-0.114	294.16489	47.176281	15.385	FALSE POSITIVE
9563	FALSE POSITIVE	0.000	0	0	1	1	4.856035	6.356000e-05	-6.356000e-05	135.993300	...	4.385	0.054	-0.216	1.193	0.410	-0.137	297.00977	47.121021	14.826	FALSE POSITIVE

9564 rows × 43 columns

This model did a little better on predicting the planet disposition but I think it could be even better. Now I’ll try it with a decision tree model to see if that works better.

Decision Tree¶

# Create a new DataFrame where the columns in cols2 have no empty values
df4 = df.copy()
for col in cols2:
    df4[col] = df4[col].fillna(df4[col].median())

To avoid confusion between models, I am going to create new test and training sets, and name this classifier clf_tree.

# Instantiate
clf_tree = DecisionTreeClassifier(max_depth = 15, max_leaf_nodes=18)

# Train and Test sets
X_train_tree, X_test_tree, y_train_tree, y_test_tree = train_test_split(df4[cols2], df4["koi_disposition"], test_size=0.2, random_state = 0)

# Fit the DecisionTreeClassifier
clf_tree.fit(X_train2, y_train2)

DecisionTreeClassifier(max_depth=15, max_leaf_nodes=18)

# Plot the Decision Tree
fig = plt.figure(figsize=(200,100))
_ = plot_tree(
    clf_tree, 
    feature_names = clf_tree.feature_names_in_, 
    class_names = clf_tree.classes_,
    filled = True
)

I’m curious to see which columns are the most important to my decision tree model:

# Create a DataFrame of the columns ("features") and their importance in relation to 
# predictions by the Decision Tree
df_features = pd.DataFrame({"Importance": clf_tree.feature_importances_, "feature": clf_tree.feature_names_in_})
df_features

	Importance	feature
0	0.693282	koi_score
1	0.084338	koi_fpflag_nt
2	0.014428	koi_fpflag_ss
3	0.019486	koi_fpflag_co
4	0.000000	koi_fpflag_ec
5	0.000000	koi_period
6	0.006403	koi_period_err1
7	0.000000	koi_period_err2
8	0.003890	koi_time0bk
9	0.000000	koi_time0bk_err1
10	0.000000	koi_time0bk_err2
11	0.023268	koi_impact
12	0.000000	koi_impact_err1
13	0.000000	koi_impact_err2
14	0.009453	koi_duration
15	0.000000	koi_duration_err1
16	0.000000	koi_duration_err2
17	0.000000	koi_depth
18	0.000000	koi_depth_err1
19	0.000000	koi_depth_err2
20	0.015040	koi_prad
21	0.000000	koi_prad_err1
22	0.000000	koi_prad_err2
23	0.000000	koi_teq
24	0.000000	koi_insol
25	0.000000	koi_insol_err1
26	0.000000	koi_insol_err2
27	0.127646	koi_model_snr
28	0.000000	koi_tce_plnt_num
29	0.000000	koi_steff
30	0.000000	koi_steff_err1
31	0.000000	koi_steff_err2
32	0.000000	koi_slogg
33	0.000000	koi_slogg_err1
34	0.000000	koi_slogg_err2
35	0.000000	koi_srad
36	0.002767	koi_srad_err1
37	0.000000	koi_srad_err2
38	0.000000	ra
39	0.000000	dec
40	0.000000	koi_kepmag

As we can see in the above dataframe, the koi_score column has the most importance when it comes to predicting the disposition of a planet. This makes sense intuitively because according to NASA, candidates and confirmed exoplanets will have scores closer to 1 most of the time while false positives will mostly have scores closer to 0.

# Find the score of the training and test sets to see how accurate my model is

print(f"The training score is {clf_tree.score(X_train_tree, y_train_tree)}")
print(f"The test score is {clf_tree.score(X_test_tree, y_test_tree)}")

The training score is 0.8919095543066266
The test score is 0.8834291688447464

This model seems to do a much better job at predicting a planet’s disposition. The scores are also very similar between the training and testing sets which implies that there is not overfitting.

Using log loss to determine accuracy of model predictions¶

Now I want to use my three models (clf, clf2, and clf_tree) in log loss functions to get a more accurate read on how well each of these models work.

clf_train_log_loss = log_loss(y_train, clf.predict_proba(X_train))
clf_test_log_loss = log_loss(y_test, clf.predict_proba(X_test))

print(f"The log loss on the training set for my first Logistic Regression model is {clf_train_log_loss}\nThe log loss on the test set is {clf_train_log_loss}")

clf2_train_log_loss = log_loss(y_train2, clf2.predict_proba(X_train2))
clf2_test_log_loss = log_loss(y_test2, clf2.predict_proba(X_test2))

print(f"\nThe log loss on the training set for my second Logistic Regression model is {clf2_train_log_loss}\nThe log loss on the test set is {clf2_train_log_loss}")

clf_tree_train_log_loss = log_loss(y_train_tree, clf_tree.predict_proba(X_train_tree))
clf_tree_test_log_loss = log_loss(y_test_tree, clf_tree.predict_proba(X_test_tree))

print(f"\nThe log loss on the training set for my Decision Tree model is {clf_tree_train_log_loss}\nThe log loss on the test set is {clf_tree_test_log_loss}")

The log loss on the training set for my first Logistic Regression model is 0.8694039930140798
The log loss on the test set is 0.8694039930140798

The log loss on the training set for my second Logistic Regression model is 0.7641640238247506
The log loss on the test set is 0.7641640238247506

The log loss on the training set for my Decision Tree model is 0.26968882404713607
The log loss on the test set is 0.31605228969315485

From these log loss scores I can tell that the Decision Tree model is clearly the most accurate one. The log loss for both logistic regression models were really high (which I excpected based on how low their scores were but I was surprised by how high their log losses actually ended up being) and the log loss for the Decision Tree model was much lower. The log loss for the test set was definitely higher than for the training set which is a little interesting to me so I want to see if I can see exactly which number of max leaf nodes and max depth works best for the decision tree model to prevent underfitting and overfitting and get the lowest log loss. Additionally, I think the log losses for the Logistic regression models were identical between the training and test sets because my dataset is so large and the training and test sets may look very similar.

Here I am going to see which max_leaf_nodes number works best:

train_error_dict = {}
test_error_dict = {}

for n in range(2, 41):
    clf_tree2 = DecisionTreeClassifier(max_depth=15, max_leaf_nodes=n)
    clf_tree2.fit(X_train_tree, y_train_tree)
    train_error_dict[n] = log_loss(y_train_tree, clf_tree2.predict_proba(X_train_tree))
    test_error_dict[n] = log_loss(y_test_tree, clf_tree2.predict_proba(X_test_tree))

df_train = pd.DataFrame({"y":train_error_dict, "type": "train"})
df_test = pd.DataFrame({"y":test_error_dict, "type": "test"})

df_small = pd.concat([df_train, df_test]).reset_index()

alt.Chart(df_small).mark_line(clip=True).encode(
    x="index:O",
    y='y',
    color="type"
).properties(
    title='Log Loss for max_leaf_nodes of 2 to 40')

train_error_dict2 = {}
test_error_dict2 = {}

for n in range(2, 41):
    clf_tree3 = DecisionTreeClassifier(max_depth=n, max_leaf_nodes=18)
    clf_tree3.fit(X_train_tree, y_train_tree)
    train_error_dict2[n] = log_loss(y_train_tree, clf_tree3.predict_proba(X_train_tree))
    test_error_dict2[n] = log_loss(y_test_tree, clf_tree3.predict_proba(X_test_tree))

df_train2 = pd.DataFrame({"y":train_error_dict2, "type": "train"})
df_test2 = pd.DataFrame({"y":test_error_dict2, "type": "test"})

df_small2 = pd.concat([df_train2, df_test2]).reset_index()

alt.Chart(df_small2).mark_line(clip=True).encode(
    x="index:O",
    y='y',
    color="type"
).properties(
    title='Log Loss for max_depth of 2 to 40')

By both of these train and test error curve plots, it seems like there are a lot of max_leaf_nodes and max_depths that work well for the Decision Tree model. The model I was using originally with max_leaf_nodes 18 and max_depth 15 is definitely in the range of parameters that, according to these plots, work the best and prevent over/underfitting.

K - Nearest Neighbors¶

I am going to try an use K-Nearest Neighbors to try and predict dispositions, using the same train test split that I used for the Decision Tree.

train_error_dict3 = {}
test_error_dict3 = {}

for k in range(10, 2001, 100):
    clf_k = KNeighborsClassifier(n_neighbors=k)
    clf_k.fit(X_train_tree, y_train_tree)
    train_error_dict3[k] = log_loss(y_train_tree, clf_k.predict_proba(X_train_tree))
    test_error_dict3[k] = log_loss(y_test_tree, clf_k.predict_proba(X_test_tree))

df_train3 = pd.DataFrame({"y":train_error_dict3, "type": "train"})
df_test3 = pd.DataFrame({"y":test_error_dict3, "type": "test"})

# Create a combined dataframe, adding a column of 1/k neighbors so the following plot looks more accurate
df_small3 = pd.concat([df_train3, df_test3]).reset_index()
df_small3.rename(columns = {"index": "K- neighbors"}, inplace = True)
df_small3["1/k"] = df_small3["K- neighbors"].map(lambda x: 1/x)

alt.Chart(df_small3).mark_line(clip=True).encode(
    x="1/k:Q",
    y='y',
    color="type"
).properties(
    title='Log Loss for k-neighbors of 10 to 2000')

This train and test error curve plot shows that the best k_neighbors to use is a very high number of k, such as between 100 to 200. Anything over that or under that range will most likely be underfitting or over fitting, which I want to avoid.

clf_k_neighbors = KNeighborsClassifier(n_neighbors=150)
clf_k_neighbors.fit(X_train_tree, y_train_tree)

KNeighborsClassifier(n_neighbors=150)

df5 = df.copy()
for col in cols2:
    df5[col] = df5[col].fillna(df5[col].median())

df5["Pred"] = clf_k_neighbors.predict(df5[cols2])

df5

	rowid	kepid	kepoi_name	kepler_name	koi_disposition	koi_pdisposition	koi_score	koi_fpflag_nt	koi_fpflag_ss	koi_fpflag_co	...	koi_slogg	koi_slogg_err1	koi_slogg_err2	koi_srad	koi_srad_err1	koi_srad_err2	ra	dec	koi_kepmag	Pred
0	1	10797460	K00752.01	Kepler-227 b	CONFIRMED	CANDIDATE	1.000	0	0	0	...	4.467	0.064	-0.096	0.927	0.105	-0.061	291.93423	48.141651	15.347	CONFIRMED
1	2	10797460	K00752.02	Kepler-227 c	CONFIRMED	CANDIDATE	0.969	0	0	0	...	4.467	0.064	-0.096	0.927	0.105	-0.061	291.93423	48.141651	15.347	CONFIRMED
2	3	10811496	K00753.01	NaN	FALSE POSITIVE	FALSE POSITIVE	0.000	0	1	0	...	4.544	0.044	-0.176	0.868	0.233	-0.078	297.00482	48.134129	15.436	FALSE POSITIVE
3	4	10848459	K00754.01	NaN	FALSE POSITIVE	FALSE POSITIVE	0.000	0	1	0	...	4.564	0.053	-0.168	0.791	0.201	-0.067	285.53461	48.285210	15.597	FALSE POSITIVE
4	5	10854555	K00755.01	Kepler-664 b	CONFIRMED	CANDIDATE	1.000	0	0	0	...	4.438	0.070	-0.210	1.046	0.334	-0.133	288.75488	48.226200	15.509	FALSE POSITIVE
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9559	9560	10031643	K07984.01	NaN	FALSE POSITIVE	FALSE POSITIVE	0.000	0	0	0	...	4.296	0.231	-0.189	1.088	0.313	-0.228	298.74921	46.973351	14.478	CANDIDATE
9560	9561	10090151	K07985.01	NaN	FALSE POSITIVE	FALSE POSITIVE	0.000	0	1	1	...	4.529	0.035	-0.196	0.903	0.237	-0.079	297.18875	47.093819	14.082	FALSE POSITIVE
9561	9562	10128825	K07986.01	NaN	CANDIDATE	CANDIDATE	0.497	0	0	0	...	4.444	0.056	-0.224	1.031	0.341	-0.114	286.50937	47.163219	14.757	FALSE POSITIVE
9562	9563	10147276	K07987.01	NaN	FALSE POSITIVE	FALSE POSITIVE	0.021	0	0	1	...	4.447	0.056	-0.224	1.041	0.341	-0.114	294.16489	47.176281	15.385	FALSE POSITIVE
9563	9564	10156110	K07989.01	NaN	FALSE POSITIVE	FALSE POSITIVE	0.000	0	0	1	...	4.385	0.054	-0.216	1.193	0.410	-0.137	297.00977	47.121021	14.826	CANDIDATE

9564 rows × 51 columns

print(f"The score on the training set is {clf_k_neighbors.score(X_train_tree, y_train_tree)}\nThe score on the test set is {clf_k_neighbors.score(X_test_tree, y_test_tree)}")

The score on the training set is 0.6535093451836361
The score on the test set is 0.6340825927861997

These scores are not very high, but they suggest that there isn’t any overfitting because the training and test scores are so similar.

Summary¶

In this project I’ve used two logistic regression models (using different sizes of data from my dataset), a Decision Tree model, and a K- Nearest neighbors model to try and predict exoplanet dispositions from the data in the NASA dataset. I fit each model with training and test sets from the dataset, and calculated scores and log loss to determine the best model. To me, it seems like the Decision Tree model is the superior one because it has the highest scores and lowest log loss, even compared to K- nearest neighbors. I also performed some feature engineering and plotting using altair to get an idea of which models are best and what the data looks like.

References¶

What is the source of your dataset(s)?

https://www.kaggle.com/datasets/nasa/kepler-exoplanet-search-results?resource=download

Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

I modified code from here to change the color schemes for some of my altair charts.
I modified code from here to add titles to my charts.

The rest of my code that was modified from elsewhere came from lectures (from this quarter and winter quarter) or posted videos from this quarter.

List other references that you found helpful.

Created in Deepnote

UC Irvine Math 10 S22

Exoplanet Candidate Analysis

Contents