Star Type
Contents
Star Type¶
Author: Yufei Ren
Course Project, UC Irvine, Math 10, W22
Introduction¶
The dataset “Star dataset to predict star types” consists several features of planets in 6 category: Brown Dwarf, Red Dwarf, White Dwarf, Main Sequence , SuperGiants, HyperGiants, and they are respectivley assigned with numbers 0, 1, 2, 3, 4, 5.
In this project, the temperature, radius, ‘Absolute magnitude(Mv)’, and luminorsity are first used to predict the star type. After that, sklearn is used to find the relationship between temperature, radius and luminorsity.
Main portion of the project¶
(You can either have all one section or divide into multiple sections)
import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
df = pd.read_csv("/work/6 class csv.csv")
df = df.dropna(axis=1) # clear the data
df.head()
Temperature (K) | Luminosity(L/Lo) | Radius(R/Ro) | Absolute magnitude(Mv) | Star type | Star color | Spectral Class | |
---|---|---|---|---|---|---|---|
0 | 3068 | 0.002400 | 0.1700 | 16.12 | 0 | Red | M |
1 | 3042 | 0.000500 | 0.1542 | 16.60 | 0 | Red | M |
2 | 2600 | 0.000300 | 0.1020 | 18.70 | 0 | Red | M |
3 | 2800 | 0.000200 | 0.1600 | 16.65 | 0 | Red | M |
4 | 1939 | 0.000138 | 0.1030 | 20.06 | 0 | Red | M |
df.describe()
Temperature (K) | Luminosity(L/Lo) | Radius(R/Ro) | Absolute magnitude(Mv) | Star type | |
---|---|---|---|---|---|
count | 240.000000 | 240.000000 | 240.000000 | 240.000000 | 240.000000 |
mean | 10497.462500 | 107188.361635 | 237.157781 | 4.382396 | 2.500000 |
std | 9552.425037 | 179432.244940 | 517.155763 | 10.532512 | 1.711394 |
min | 1939.000000 | 0.000080 | 0.008400 | -11.920000 | 0.000000 |
25% | 3344.250000 | 0.000865 | 0.102750 | -6.232500 | 1.000000 |
50% | 5776.000000 | 0.070500 | 0.762500 | 8.313000 | 2.500000 |
75% | 15055.500000 | 198050.000000 | 42.750000 | 13.697500 | 4.000000 |
max | 40000.000000 | 849420.000000 | 1948.500000 | 20.060000 | 5.000000 |
df.columns
Index(['Temperature (K)', 'Luminosity(L/Lo)', 'Radius(R/Ro)',
'Absolute magnitude(Mv)', 'Star type', 'Star color', 'Spectral Class'],
dtype='object')
Atair charts is used to visualize the dataset before predicting.
brush = alt.selection_interval()
c1 = alt.Chart(df).mark_point().encode(
x='Absolute magnitude(Mv)',
y='Radius(R/Ro):Q',
color='Star type:N'
).add_selection(brush)
c2= alt.Chart(df).mark_bar().encode(
x = 'Star type:N',
y='Absolute magnitude(Mv)'
).transform_filter(brush)
c1|c2
Predict the Star type¶
Firstly, KNeighborsClassifier is used to predict the star type
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
X = df.iloc[:,:4]
y = df["Star type"]
Before using using K-Nearest Neighbors Classifier, a scaler is used to scale the input data to avoid errors.
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
loss_train = log_loss(y_train, clf.predict_proba(X_train))
loss_test = log_loss(y_test, clf.predict_proba(X_test))
print(f"The log_loss of X_train and y_train is {loss_train:.2f}")
print(f"The log_loss of X_test and y_test is {loss_test:.2f}")
The log_loss of X_train and y_train is 0.03
The log_loss of X_test and y_test is 0.02
df['predict_K'] = clf.predict(X_scaled)
The logloss of testing data is not large, so there isn’t a sign of overfitting
(df["Star type"] == df["predict_K"]).value_counts()
True 237
False 3
dtype: int64
Here we can see that the predicted data is very close to the real data, and there isn’t a sign me over-fitting.
Predict the Luminosity¶
After using K Neraerst Neighbors to predict the type of a star, I am interested in finding how does radius and temperature are related to the luminorsity of a star.
I first try the LinearRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
X2 = df[['Radius(R/Ro)','Temperature (K)']]
y2 = df['Luminosity(L/Lo)']
reg1 = LinearRegression()
reg1.fit(X2,y2)
MSE1 = mean_squared_error(y2,reg1.predict(X2))
MAE1 = mean_absolute_error(y2,reg1.predict(X2))
print(f"the coefficients of reg are {reg1.coef_}")
print(f"the intersept of reg is {reg1.intercept_}.")
print(f'The Mean square error is {MSE1:.3f}')
print(f'The Mean absolute error is {MAE1:.3f}')
the coefficients of reg are [174.63473048 6.78255167]
the intersept of reg is -5427.205389764014.
The Mean square error is 19010660326.728
The Mean absolute error is 87486.680
The MSE is too high at this case, then I choose to try the KneighborRegressor, and again, the input should be scaled first because they are not in the same unit.
from sklearn.neighbors import KNeighborsRegressor
scaler = StandardScaler()
scaler.fit(X2)
X2_scaled = scaler.transform(X2)
reg2 = KNeighborsRegressor(n_neighbors=4)
reg2.fit(X2_scaled, y2)
df['predict_l'] = reg2.predict(X2_scaled)
MSE2 = mean_squared_error(reg2.predict(X2_scaled),y2)
MAE2 = mean_absolute_error(reg2.predict(X2_scaled),y2)
print(f'The Mean square error is {MSE2:.3f}')
print(f'The Mean absolute error is {MAE2:.3f}')
The Mean square error is 9605051475.136
The Mean absolute error is 47796.629
The number is still large, but smaller than the prediced error in linear regression. The reason for it might be that it is not a linear relationship, but a polynomial relationship.
To check if it is a polynomial regression, the polynomialfeatures is used.
df3 = df.iloc[:,:3]
df3.columns
Index(['Temperature (K)', 'Luminosity(L/Lo)', 'Radius(R/Ro)'], dtype='object')
y_ply = df['Luminosity(L/Lo)']
X_ply = df[['Temperature (K)', 'Radius(R/Ro)']]
from sklearn.preprocessing import PolynomialFeatures
Here I first created a dataframe that contains all posibilities of combination of temperature and radius within 9 degree.
poly = PolynomialFeatures(degree=9)
df_ply = pd.DataFrame(poly.fit_transform(X_ply))
df_ply.columns = poly.get_feature_names_out()
df_ply
1 | Temperature (K) | Radius(R/Ro) | Temperature (K)^2 | Temperature (K) Radius(R/Ro) | Radius(R/Ro)^2 | Temperature (K)^3 | Temperature (K)^2 Radius(R/Ro) | Temperature (K) Radius(R/Ro)^2 | Radius(R/Ro)^3 | ... | Temperature (K)^9 | Temperature (K)^8 Radius(R/Ro) | Temperature (K)^7 Radius(R/Ro)^2 | Temperature (K)^6 Radius(R/Ro)^3 | Temperature (K)^5 Radius(R/Ro)^4 | Temperature (K)^4 Radius(R/Ro)^5 | Temperature (K)^3 Radius(R/Ro)^6 | Temperature (K)^2 Radius(R/Ro)^7 | Temperature (K) Radius(R/Ro)^8 | Radius(R/Ro)^9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 3068.0 | 0.1700 | 9.412624e+06 | 5.215600e+02 | 2.890000e-02 | 2.887793e+10 | 1.600146e+06 | 8.866520e+01 | 4.913000e-03 | ... | 2.408231e+31 | 1.334418e+27 | 7.394100e+22 | 4.097122e+18 | 2.270244e+14 | 1.257958e+10 | 6.970430e+05 | 3.862364e+01 | 2.140162e-03 | 1.185879e-07 |
1 | 1.0 | 3042.0 | 0.1542 | 9.253764e+06 | 4.690764e+02 | 2.377764e-02 | 2.814995e+10 | 1.426930e+06 | 7.233158e+01 | 3.666512e-03 | ... | 2.230657e+31 | 1.130728e+27 | 5.731697e+22 | 2.905416e+18 | 1.472765e+14 | 7.465497e+09 | 3.784285e+05 | 1.918267e+01 | 9.723759e-04 | 4.929006e-08 |
2 | 1.0 | 2600.0 | 0.1020 | 6.760000e+06 | 2.652000e+02 | 1.040400e-02 | 1.757600e+10 | 6.895200e+05 | 2.705040e+01 | 1.061208e-03 | ... | 5.429504e+30 | 2.130036e+26 | 8.356295e+21 | 3.278239e+17 | 1.286078e+13 | 5.045384e+08 | 1.979343e+04 | 7.765115e-01 | 3.046314e-05 | 1.195093e-09 |
3 | 1.0 | 2800.0 | 0.1600 | 7.840000e+06 | 4.480000e+02 | 2.560000e-02 | 2.195200e+10 | 1.254400e+06 | 7.168000e+01 | 4.096000e-03 | ... | 1.057846e+31 | 6.044832e+26 | 3.454190e+22 | 1.973823e+18 | 1.127899e+14 | 6.445135e+09 | 3.682934e+05 | 2.104534e+01 | 1.202591e-03 | 6.871948e-08 |
4 | 1.0 | 1939.0 | 0.1030 | 3.759721e+06 | 1.997170e+02 | 1.060900e-02 | 7.290099e+09 | 3.872513e+05 | 2.057085e+01 | 1.092727e-03 | ... | 3.874363e+29 | 2.058068e+25 | 1.093249e+21 | 5.807357e+16 | 3.084878e+12 | 1.638692e+08 | 8.704759e+03 | 4.623983e-01 | 2.456267e-05 | 1.304773e-09 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
235 | 1.0 | 38940.0 | 1356.0000 | 1.516324e+09 | 5.280264e+07 | 1.838736e+06 | 5.904564e+13 | 2.056135e+12 | 7.160038e+10 | 2.493326e+09 | ... | 2.058560e+41 | 7.168483e+39 | 2.496267e+38 | 8.692701e+36 | 3.027042e+35 | 1.054101e+34 | 3.670675e+32 | 1.278232e+31 | 4.451163e+29 | 1.550020e+28 |
236 | 1.0 | 30839.0 | 1194.0000 | 9.510439e+08 | 3.682177e+07 | 1.425636e+06 | 2.932924e+13 | 1.135546e+12 | 4.396519e+10 | 1.702209e+09 | ... | 2.522915e+40 | 9.768022e+38 | 3.781905e+37 | 1.464248e+36 | 5.669160e+34 | 2.194941e+33 | 8.498198e+31 | 3.290265e+30 | 1.273899e+29 | 4.932180e+27 |
237 | 1.0 | 8829.0 | 1423.0000 | 7.795124e+07 | 1.256367e+07 | 2.024929e+06 | 6.882315e+11 | 1.109246e+11 | 1.787810e+10 | 2.881474e+09 | ... | 3.259895e+35 | 5.254084e+34 | 8.468187e+33 | 1.364846e+33 | 2.199770e+32 | 3.545443e+31 | 5.714312e+30 | 9.209951e+29 | 1.484399e+29 | 2.392457e+28 |
238 | 1.0 | 9235.0 | 1112.0000 | 8.528522e+07 | 1.026932e+07 | 1.236544e+06 | 7.876091e+11 | 9.483717e+10 | 1.141948e+10 | 1.375037e+09 | ... | 4.885760e+35 | 5.883015e+34 | 7.083826e+33 | 8.529739e+32 | 1.027079e+32 | 1.236720e+31 | 1.489153e+30 | 1.793112e+29 | 2.159112e+28 | 2.599819e+27 |
239 | 1.0 | 37882.0 | 1783.0000 | 1.435046e+09 | 6.754361e+07 | 3.179089e+06 | 5.436241e+13 | 2.558687e+12 | 1.204302e+11 | 5.668316e+09 | ... | 1.606557e+41 | 7.561615e+39 | 3.559041e+38 | 1.675141e+37 | 7.884422e+35 | 3.710977e+34 | 1.746653e+33 | 8.221010e+31 | 3.869400e+30 | 1.821219e+29 |
240 rows × 55 columns
Then I apply linear regression on luminorsity and each predited polynomial combination, and caculate the error. In the end, I printed out the smallest error and its combination.
error_dict = {}
for column in df_ply:
reg = LinearRegression()
reg.fit(df_ply[[column]], y_ply)
error = mean_squared_error(reg.predict(df_ply[[column]]), y_ply)
error_dict[error] = column
print("the smallest mean squared error is", min(error_dict), 'from column', error_dict[min(error_dict)])
the smallest mean squared error is 23173652154.675903 from column Radius(R/Ro)
Here we can see the lowest mean squred error is around 2.3 * 10^10, and the linear combinaiton is Radius^1 * Temperature^0
The error is very large and a possible reason for that is that all star types are evaluated together and their ranges are in very different scales. As a result, different star types are evaluated separated below.
alt.Chart(df).mark_boxplot(extent='min-max').encode(
x='Star type:N',
y='Luminosity(L/Lo):Q'
)
In the plotbox above, it is apparent that the ranges of luminosity of different star types are in very different scale
def find_combination(star_type):
df_star = df[df['Star type'] == star_type].iloc[:,:3]
X = df_star[['Temperature (K)', 'Radius(R/Ro)']]
y = df_star['Luminosity(L/Lo)']
poly = PolynomialFeatures(degree=9)
df_ply = pd.DataFrame(poly.fit_transform(X))
df_ply.columns = poly.get_feature_names_out()
error_dict = {}
for column in df_ply:
reg = LinearRegression()
reg.fit(df_ply[[column]], y)
error = mean_squared_error(reg.predict(df_ply[[column]]), y)
error_dict[error] = column
print(f"For the star type {star_type}, the smallest error is {min(error_dict)}, which is generagted form {error_dict[min(error_dict)]}")
for i in range(5):
find_combination(i)
For the star type 0, the smallest error is 7.157957086249859e-07, which is generagted form Temperature (K)^2
For the star type 1, the smallest error is 5.09406889887359e-05, which is generagted form Temperature (K)^4 Radius(R/Ro)^5
For the star type 2, the smallest error is 3.540347414239562e-05, which is generagted form Temperature (K)^9
For the star type 3, the smallest error is 714292055.875457, which is generagted form Temperature (K)^4
For the star type 4, the smallest error is 24439177145.3056, which is generagted form Temperature (K)^9
After applying polynomialfeatures to different star type separated, the mean squared error reduced apparently. However, different star type has lowest error with different polynomial combination. As a result, it is not safe to claim any polynomial combination of temperature and radius is the best to predict the Luminosity.
Summary¶
In this project, I am able to predic the star’s type by using KneighborClasifier with comparatively high acurracy. However, a best polynomial combination of temperature and radius to predic the luminorsity is not find, because the best structures of different star types differ. As a result, a larger dataset is needed to get a more accurate result.
References¶
The dataset “6 class csv.csv” was adapted from Star dataset to predict star types
The mthods and application of polynomialfeature was adapted from sklearn.preprocessing.PolynomialFeatures
The idea of polynomialfeature is adapted from Introduction to Polynomial Regression (with Python Implementation)
The code of drawing altair histogram is adapted from Simple Histogram
The code of drawing boxplot is adapted from Boxplot with Min/Max Whiskers
Created in Deepnote