Star Type

Author: Yufei Ren

Course Project, UC Irvine, Math 10, W22

Introduction

The dataset “Star dataset to predict star types” consists several features of planets in 6 category: Brown Dwarf, Red Dwarf, White Dwarf, Main Sequence , SuperGiants, HyperGiants, and they are respectivley assigned with numbers 0, 1, 2, 3, 4, 5.

In this project, the temperature, radius, ‘Absolute magnitude(Mv)’, and luminorsity are first used to predict the star type. After that, sklearn is used to find the relationship between temperature, radius and luminorsity.

Main portion of the project

(You can either have all one section or divide into multiple sections)

import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
df = pd.read_csv("/work/6 class csv.csv")
df = df.dropna(axis=1) # clear the data
df.head()
Temperature (K) Luminosity(L/Lo) Radius(R/Ro) Absolute magnitude(Mv) Star type Star color Spectral Class
0 3068 0.002400 0.1700 16.12 0 Red M
1 3042 0.000500 0.1542 16.60 0 Red M
2 2600 0.000300 0.1020 18.70 0 Red M
3 2800 0.000200 0.1600 16.65 0 Red M
4 1939 0.000138 0.1030 20.06 0 Red M
df.describe()
Temperature (K) Luminosity(L/Lo) Radius(R/Ro) Absolute magnitude(Mv) Star type
count 240.000000 240.000000 240.000000 240.000000 240.000000
mean 10497.462500 107188.361635 237.157781 4.382396 2.500000
std 9552.425037 179432.244940 517.155763 10.532512 1.711394
min 1939.000000 0.000080 0.008400 -11.920000 0.000000
25% 3344.250000 0.000865 0.102750 -6.232500 1.000000
50% 5776.000000 0.070500 0.762500 8.313000 2.500000
75% 15055.500000 198050.000000 42.750000 13.697500 4.000000
max 40000.000000 849420.000000 1948.500000 20.060000 5.000000
df.columns
Index(['Temperature (K)', 'Luminosity(L/Lo)', 'Radius(R/Ro)',
       'Absolute magnitude(Mv)', 'Star type', 'Star color', 'Spectral Class'],
      dtype='object')

Atair charts is used to visualize the dataset before predicting.

brush = alt.selection_interval()
c1 = alt.Chart(df).mark_point().encode(
    x='Absolute magnitude(Mv)',
    y='Radius(R/Ro):Q',
    color='Star type:N'
).add_selection(brush)

c2= alt.Chart(df).mark_bar().encode(
    x = 'Star type:N',
    y='Absolute magnitude(Mv)'
).transform_filter(brush)

c1|c2

Predict the Star type

Firstly, KNeighborsClassifier is used to predict the star type

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
X = df.iloc[:,:4]
y = df["Star type"]

Before using using K-Nearest Neighbors Classifier, a scaler is used to scale the input data to avoid errors.

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
loss_train = log_loss(y_train, clf.predict_proba(X_train))
loss_test = log_loss(y_test, clf.predict_proba(X_test))
print(f"The log_loss of X_train and y_train is {loss_train:.2f}")
print(f"The log_loss of X_test and y_test is {loss_test:.2f}")
The log_loss of X_train and y_train is 0.03
The log_loss of X_test and y_test is 0.02
df['predict_K'] = clf.predict(X_scaled)

The logloss of testing data is not large, so there isn’t a sign of overfitting

(df["Star type"] == df["predict_K"]).value_counts()
True     237
False      3
dtype: int64

Here we can see that the predicted data is very close to the real data, and there isn’t a sign me over-fitting.

Predict the Luminosity

After using K Neraerst Neighbors to predict the type of a star, I am interested in finding how does radius and temperature are related to the luminorsity of a star.

I first try the LinearRegressor

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
X2 = df[['Radius(R/Ro)','Temperature (K)']]
y2 = df['Luminosity(L/Lo)']
reg1 = LinearRegression()
reg1.fit(X2,y2)
MSE1 = mean_squared_error(y2,reg1.predict(X2))
MAE1 = mean_absolute_error(y2,reg1.predict(X2))
print(f"the coefficients of reg are {reg1.coef_}")
print(f"the intersept of reg is {reg1.intercept_}.")
print(f'The Mean square error is {MSE1:.3f}')
print(f'The Mean absolute error is {MAE1:.3f}')
the coefficients of reg are [174.63473048   6.78255167]
the intersept of reg is -5427.205389764014.
The Mean square error is 19010660326.728
The Mean absolute error is 87486.680

The MSE is too high at this case, then I choose to try the KneighborRegressor, and again, the input should be scaled first because they are not in the same unit.

from sklearn.neighbors import KNeighborsRegressor
scaler = StandardScaler()
scaler.fit(X2)
X2_scaled = scaler.transform(X2)
reg2 = KNeighborsRegressor(n_neighbors=4)
reg2.fit(X2_scaled, y2)
df['predict_l'] = reg2.predict(X2_scaled)
MSE2 = mean_squared_error(reg2.predict(X2_scaled),y2)
MAE2 = mean_absolute_error(reg2.predict(X2_scaled),y2)
print(f'The Mean square error is {MSE2:.3f}')
print(f'The Mean absolute error is {MAE2:.3f}')
The Mean square error is 9605051475.136
The Mean absolute error is 47796.629

The number is still large, but smaller than the prediced error in linear regression. The reason for it might be that it is not a linear relationship, but a polynomial relationship.

To check if it is a polynomial regression, the polynomialfeatures is used.

df3 = df.iloc[:,:3]
df3.columns
Index(['Temperature (K)', 'Luminosity(L/Lo)', 'Radius(R/Ro)'], dtype='object')
y_ply = df['Luminosity(L/Lo)']
X_ply = df[['Temperature (K)', 'Radius(R/Ro)']]
from sklearn.preprocessing import PolynomialFeatures

Here I first created a dataframe that contains all posibilities of combination of temperature and radius within 9 degree.

poly = PolynomialFeatures(degree=9)
df_ply = pd.DataFrame(poly.fit_transform(X_ply))
df_ply.columns = poly.get_feature_names_out()
df_ply
1 Temperature (K) Radius(R/Ro) Temperature (K)^2 Temperature (K) Radius(R/Ro) Radius(R/Ro)^2 Temperature (K)^3 Temperature (K)^2 Radius(R/Ro) Temperature (K) Radius(R/Ro)^2 Radius(R/Ro)^3 ... Temperature (K)^9 Temperature (K)^8 Radius(R/Ro) Temperature (K)^7 Radius(R/Ro)^2 Temperature (K)^6 Radius(R/Ro)^3 Temperature (K)^5 Radius(R/Ro)^4 Temperature (K)^4 Radius(R/Ro)^5 Temperature (K)^3 Radius(R/Ro)^6 Temperature (K)^2 Radius(R/Ro)^7 Temperature (K) Radius(R/Ro)^8 Radius(R/Ro)^9
0 1.0 3068.0 0.1700 9.412624e+06 5.215600e+02 2.890000e-02 2.887793e+10 1.600146e+06 8.866520e+01 4.913000e-03 ... 2.408231e+31 1.334418e+27 7.394100e+22 4.097122e+18 2.270244e+14 1.257958e+10 6.970430e+05 3.862364e+01 2.140162e-03 1.185879e-07
1 1.0 3042.0 0.1542 9.253764e+06 4.690764e+02 2.377764e-02 2.814995e+10 1.426930e+06 7.233158e+01 3.666512e-03 ... 2.230657e+31 1.130728e+27 5.731697e+22 2.905416e+18 1.472765e+14 7.465497e+09 3.784285e+05 1.918267e+01 9.723759e-04 4.929006e-08
2 1.0 2600.0 0.1020 6.760000e+06 2.652000e+02 1.040400e-02 1.757600e+10 6.895200e+05 2.705040e+01 1.061208e-03 ... 5.429504e+30 2.130036e+26 8.356295e+21 3.278239e+17 1.286078e+13 5.045384e+08 1.979343e+04 7.765115e-01 3.046314e-05 1.195093e-09
3 1.0 2800.0 0.1600 7.840000e+06 4.480000e+02 2.560000e-02 2.195200e+10 1.254400e+06 7.168000e+01 4.096000e-03 ... 1.057846e+31 6.044832e+26 3.454190e+22 1.973823e+18 1.127899e+14 6.445135e+09 3.682934e+05 2.104534e+01 1.202591e-03 6.871948e-08
4 1.0 1939.0 0.1030 3.759721e+06 1.997170e+02 1.060900e-02 7.290099e+09 3.872513e+05 2.057085e+01 1.092727e-03 ... 3.874363e+29 2.058068e+25 1.093249e+21 5.807357e+16 3.084878e+12 1.638692e+08 8.704759e+03 4.623983e-01 2.456267e-05 1.304773e-09
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
235 1.0 38940.0 1356.0000 1.516324e+09 5.280264e+07 1.838736e+06 5.904564e+13 2.056135e+12 7.160038e+10 2.493326e+09 ... 2.058560e+41 7.168483e+39 2.496267e+38 8.692701e+36 3.027042e+35 1.054101e+34 3.670675e+32 1.278232e+31 4.451163e+29 1.550020e+28
236 1.0 30839.0 1194.0000 9.510439e+08 3.682177e+07 1.425636e+06 2.932924e+13 1.135546e+12 4.396519e+10 1.702209e+09 ... 2.522915e+40 9.768022e+38 3.781905e+37 1.464248e+36 5.669160e+34 2.194941e+33 8.498198e+31 3.290265e+30 1.273899e+29 4.932180e+27
237 1.0 8829.0 1423.0000 7.795124e+07 1.256367e+07 2.024929e+06 6.882315e+11 1.109246e+11 1.787810e+10 2.881474e+09 ... 3.259895e+35 5.254084e+34 8.468187e+33 1.364846e+33 2.199770e+32 3.545443e+31 5.714312e+30 9.209951e+29 1.484399e+29 2.392457e+28
238 1.0 9235.0 1112.0000 8.528522e+07 1.026932e+07 1.236544e+06 7.876091e+11 9.483717e+10 1.141948e+10 1.375037e+09 ... 4.885760e+35 5.883015e+34 7.083826e+33 8.529739e+32 1.027079e+32 1.236720e+31 1.489153e+30 1.793112e+29 2.159112e+28 2.599819e+27
239 1.0 37882.0 1783.0000 1.435046e+09 6.754361e+07 3.179089e+06 5.436241e+13 2.558687e+12 1.204302e+11 5.668316e+09 ... 1.606557e+41 7.561615e+39 3.559041e+38 1.675141e+37 7.884422e+35 3.710977e+34 1.746653e+33 8.221010e+31 3.869400e+30 1.821219e+29

240 rows × 55 columns

Then I apply linear regression on luminorsity and each predited polynomial combination, and caculate the error. In the end, I printed out the smallest error and its combination.

error_dict = {}
for column in df_ply:
    reg = LinearRegression()
    reg.fit(df_ply[[column]], y_ply)
    error = mean_squared_error(reg.predict(df_ply[[column]]), y_ply)
    error_dict[error] = column
print("the smallest mean squared error is", min(error_dict), 'from column', error_dict[min(error_dict)])
the smallest mean squared error is 23173652154.675903 from column Radius(R/Ro)

Here we can see the lowest mean squred error is around 2.3 * 10^10, and the linear combinaiton is Radius^1 * Temperature^0

The error is very large and a possible reason for that is that all star types are evaluated together and their ranges are in very different scales. As a result, different star types are evaluated separated below.

alt.Chart(df).mark_boxplot(extent='min-max').encode(
    x='Star type:N',
    y='Luminosity(L/Lo):Q'
)

In the plotbox above, it is apparent that the ranges of luminosity of different star types are in very different scale

def find_combination(star_type):
    df_star = df[df['Star type'] == star_type].iloc[:,:3]
    X = df_star[['Temperature (K)', 'Radius(R/Ro)']]
    y = df_star['Luminosity(L/Lo)']
    poly = PolynomialFeatures(degree=9)
    df_ply = pd.DataFrame(poly.fit_transform(X))
    df_ply.columns = poly.get_feature_names_out()
    error_dict = {}
    for column in df_ply:
        reg = LinearRegression()
        reg.fit(df_ply[[column]], y)
        error = mean_squared_error(reg.predict(df_ply[[column]]), y)
        error_dict[error] = column
    print(f"For the star type {star_type}, the smallest error is {min(error_dict)}, which is generagted form {error_dict[min(error_dict)]}")
for i in range(5):
    find_combination(i)
For the star type 0, the smallest error is 7.157957086249859e-07, which is generagted form Temperature (K)^2
For the star type 1, the smallest error is 5.09406889887359e-05, which is generagted form Temperature (K)^4 Radius(R/Ro)^5
For the star type 2, the smallest error is 3.540347414239562e-05, which is generagted form Temperature (K)^9
For the star type 3, the smallest error is 714292055.875457, which is generagted form Temperature (K)^4
For the star type 4, the smallest error is 24439177145.3056, which is generagted form Temperature (K)^9

After applying polynomialfeatures to different star type separated, the mean squared error reduced apparently. However, different star type has lowest error with different polynomial combination. As a result, it is not safe to claim any polynomial combination of temperature and radius is the best to predict the Luminosity.

Summary

In this project, I am able to predic the star’s type by using KneighborClasifier with comparatively high acurracy. However, a best polynomial combination of temperature and radius to predic the luminorsity is not find, because the best structures of different star types differ. As a result, a larger dataset is needed to get a more accurate result.

References

The dataset “6 class csv.csv” was adapted from Star dataset to predict star types

The mthods and application of polynomialfeature was adapted from sklearn.preprocessing.PolynomialFeatures

The idea of polynomialfeature is adapted from Introduction to Polynomial Regression (with Python Implementation)

The code of drawing altair histogram is adapted from Simple Histogram

The code of drawing boxplot is adapted from Boxplot with Min/Max Whiskers

Created in deepnote.com Created in Deepnote