Star Type¶

Author: Yufei Ren

Course Project, UC Irvine, Math 10, W22

Introduction¶

The dataset “Star dataset to predict star types” consists several features of planets in 6 category: Brown Dwarf, Red Dwarf, White Dwarf, Main Sequence , SuperGiants, HyperGiants, and they are respectivley assigned with numbers 0, 1, 2, 3, 4, 5.

In this project, the temperature, radius, ‘Absolute magnitude(Mv)’, and luminorsity are first used to predict the star type. After that, sklearn is used to find the relationship between temperature, radius and luminorsity.

Main portion of the project¶

(You can either have all one section or divide into multiple sections)

import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt

df = pd.read_csv("/work/6 class csv.csv")
df = df.dropna(axis=1) # clear the data
df.head()

	Temperature (K)	Luminosity(L/Lo)	Radius(R/Ro)	Absolute magnitude(Mv)	Star color	Spectral Class
0	3068	0.002400	0.1700	16.12	Red	M
1	3042	0.000500	0.1542	16.60	Red	M
2	2600	0.000300	0.1020	18.70	Red	M
3	2800	0.000200	0.1600	16.65	Red	M
4	1939	0.000138	0.1030	20.06	Red	M

df.describe()

	Temperature (K)	Luminosity(L/Lo)	Radius(R/Ro)	Absolute magnitude(Mv)	Star type
count	240.000000	240.000000	240.000000	240.000000	240.000000
mean	10497.462500	107188.361635	237.157781	4.382396	2.500000
std	9552.425037	179432.244940	517.155763	10.532512	1.711394
min	1939.000000	0.000080	0.008400	-11.920000	0.000000
25%	3344.250000	0.000865	0.102750	-6.232500	1.000000
50%	5776.000000	0.070500	0.762500	8.313000	2.500000
75%	15055.500000	198050.000000	42.750000	13.697500	4.000000
max	40000.000000	849420.000000	1948.500000	20.060000	5.000000

df.columns

Index(['Temperature (K)', 'Luminosity(L/Lo)', 'Radius(R/Ro)',
       'Absolute magnitude(Mv)', 'Star type', 'Star color', 'Spectral Class'],
      dtype='object')

Atair charts is used to visualize the dataset before predicting.

brush = alt.selection_interval()
c1 = alt.Chart(df).mark_point().encode(
    x='Absolute magnitude(Mv)',
    y='Radius(R/Ro):Q',
    color='Star type:N'
).add_selection(brush)

c2= alt.Chart(df).mark_bar().encode(
    x = 'Star type:N',
    y='Absolute magnitude(Mv)'
).transform_filter(brush)

c1|c2

Predict the Star type¶

Firstly, KNeighborsClassifier is used to predict the star type

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

X = df.iloc[:,:4]
y = df["Star type"]

Before using using K-Nearest Neighbors Classifier, a scaler is used to scale the input data to avoid errors.

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)

clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
loss_train = log_loss(y_train, clf.predict_proba(X_train))
loss_test = log_loss(y_test, clf.predict_proba(X_test))

print(f"The log_loss of X_train and y_train is {loss_train:.2f}")
print(f"The log_loss of X_test and y_test is {loss_test:.2f}")

The log_loss of X_train and y_train is 0.03
The log_loss of X_test and y_test is 0.02

df['predict_K'] = clf.predict(X_scaled)

The logloss of testing data is not large, so there isn’t a sign of overfitting

(df["Star type"] == df["predict_K"]).value_counts()

True     237
False      3
dtype: int64

Here we can see that the predicted data is very close to the real data, and there isn’t a sign me over-fitting.

Predict the Luminosity¶

After using K Neraerst Neighbors to predict the type of a star, I am interested in finding how does radius and temperature are related to the luminorsity of a star.

I first try the LinearRegressor

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
X2 = df[['Radius(R/Ro)','Temperature (K)']]
y2 = df['Luminosity(L/Lo)']
reg1 = LinearRegression()
reg1.fit(X2,y2)
MSE1 = mean_squared_error(y2,reg1.predict(X2))
MAE1 = mean_absolute_error(y2,reg1.predict(X2))
print(f"the coefficients of reg are {reg1.coef_}")
print(f"the intersept of reg is {reg1.intercept_}.")
print(f'The Mean square error is {MSE1:.3f}')
print(f'The Mean absolute error is {MAE1:.3f}')

the coefficients of reg are [174.63473048   6.78255167]
the intersept of reg is -5427.205389764014.
The Mean square error is 19010660326.728
The Mean absolute error is 87486.680

The MSE is too high at this case, then I choose to try the KneighborRegressor, and again, the input should be scaled first because they are not in the same unit.

from sklearn.neighbors import KNeighborsRegressor
scaler = StandardScaler()
scaler.fit(X2)
X2_scaled = scaler.transform(X2)
reg2 = KNeighborsRegressor(n_neighbors=4)

reg2.fit(X2_scaled, y2)
df['predict_l'] = reg2.predict(X2_scaled)
MSE2 = mean_squared_error(reg2.predict(X2_scaled),y2)
MAE2 = mean_absolute_error(reg2.predict(X2_scaled),y2)
print(f'The Mean square error is {MSE2:.3f}')
print(f'The Mean absolute error is {MAE2:.3f}')

The Mean square error is 9605051475.136
The Mean absolute error is 47796.629

The number is still large, but smaller than the prediced error in linear regression. The reason for it might be that it is not a linear relationship, but a polynomial relationship.

To check if it is a polynomial regression, the polynomialfeatures is used.

df3 = df.iloc[:,:3]
df3.columns

Index(['Temperature (K)', 'Luminosity(L/Lo)', 'Radius(R/Ro)'], dtype='object')

y_ply = df['Luminosity(L/Lo)']
X_ply = df[['Temperature (K)', 'Radius(R/Ro)']]

from sklearn.preprocessing import PolynomialFeatures

Here I first created a dataframe that contains all posibilities of combination of temperature and radius within 9 degree.

poly = PolynomialFeatures(degree=9)
df_ply = pd.DataFrame(poly.fit_transform(X_ply))
df_ply.columns = poly.get_feature_names_out()

df_ply

	1	Temperature (K)	Radius(R/Ro)	Temperature (K)^2	Temperature (K) Radius(R/Ro)	Radius(R/Ro)^2	Temperature (K)^3	Temperature (K)^2 Radius(R/Ro)	Temperature (K) Radius(R/Ro)^2	Radius(R/Ro)^3	...	Temperature (K)^9	Temperature (K)^8 Radius(R/Ro)	Temperature (K)^7 Radius(R/Ro)^2	Temperature (K)^6 Radius(R/Ro)^3	Temperature (K)^5 Radius(R/Ro)^4	Temperature (K)^4 Radius(R/Ro)^5	Temperature (K)^3 Radius(R/Ro)^6	Temperature (K)^2 Radius(R/Ro)^7	Temperature (K) Radius(R/Ro)^8	Radius(R/Ro)^9
0	1.0	3068.0	0.1700	9.412624e+06	5.215600e+02	2.890000e-02	2.887793e+10	1.600146e+06	8.866520e+01	4.913000e-03	...	2.408231e+31	1.334418e+27	7.394100e+22	4.097122e+18	2.270244e+14	1.257958e+10	6.970430e+05	3.862364e+01	2.140162e-03	1.185879e-07
1	1.0	3042.0	0.1542	9.253764e+06	4.690764e+02	2.377764e-02	2.814995e+10	1.426930e+06	7.233158e+01	3.666512e-03	...	2.230657e+31	1.130728e+27	5.731697e+22	2.905416e+18	1.472765e+14	7.465497e+09	3.784285e+05	1.918267e+01	9.723759e-04	4.929006e-08
2	1.0	2600.0	0.1020	6.760000e+06	2.652000e+02	1.040400e-02	1.757600e+10	6.895200e+05	2.705040e+01	1.061208e-03	...	5.429504e+30	2.130036e+26	8.356295e+21	3.278239e+17	1.286078e+13	5.045384e+08	1.979343e+04	7.765115e-01	3.046314e-05	1.195093e-09
3	1.0	2800.0	0.1600	7.840000e+06	4.480000e+02	2.560000e-02	2.195200e+10	1.254400e+06	7.168000e+01	4.096000e-03	...	1.057846e+31	6.044832e+26	3.454190e+22	1.973823e+18	1.127899e+14	6.445135e+09	3.682934e+05	2.104534e+01	1.202591e-03	6.871948e-08
4	1.0	1939.0	0.1030	3.759721e+06	1.997170e+02	1.060900e-02	7.290099e+09	3.872513e+05	2.057085e+01	1.092727e-03	...	3.874363e+29	2.058068e+25	1.093249e+21	5.807357e+16	3.084878e+12	1.638692e+08	8.704759e+03	4.623983e-01	2.456267e-05	1.304773e-09
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
235	1.0	38940.0	1356.0000	1.516324e+09	5.280264e+07	1.838736e+06	5.904564e+13	2.056135e+12	7.160038e+10	2.493326e+09	...	2.058560e+41	7.168483e+39	2.496267e+38	8.692701e+36	3.027042e+35	1.054101e+34	3.670675e+32	1.278232e+31	4.451163e+29	1.550020e+28
236	1.0	30839.0	1194.0000	9.510439e+08	3.682177e+07	1.425636e+06	2.932924e+13	1.135546e+12	4.396519e+10	1.702209e+09	...	2.522915e+40	9.768022e+38	3.781905e+37	1.464248e+36	5.669160e+34	2.194941e+33	8.498198e+31	3.290265e+30	1.273899e+29	4.932180e+27
237	1.0	8829.0	1423.0000	7.795124e+07	1.256367e+07	2.024929e+06	6.882315e+11	1.109246e+11	1.787810e+10	2.881474e+09	...	3.259895e+35	5.254084e+34	8.468187e+33	1.364846e+33	2.199770e+32	3.545443e+31	5.714312e+30	9.209951e+29	1.484399e+29	2.392457e+28
238	1.0	9235.0	1112.0000	8.528522e+07	1.026932e+07	1.236544e+06	7.876091e+11	9.483717e+10	1.141948e+10	1.375037e+09	...	4.885760e+35	5.883015e+34	7.083826e+33	8.529739e+32	1.027079e+32	1.236720e+31	1.489153e+30	1.793112e+29	2.159112e+28	2.599819e+27
239	1.0	37882.0	1783.0000	1.435046e+09	6.754361e+07	3.179089e+06	5.436241e+13	2.558687e+12	1.204302e+11	5.668316e+09	...	1.606557e+41	7.561615e+39	3.559041e+38	1.675141e+37	7.884422e+35	3.710977e+34	1.746653e+33	8.221010e+31	3.869400e+30	1.821219e+29

240 rows × 55 columns

Then I apply linear regression on luminorsity and each predited polynomial combination, and caculate the error. In the end, I printed out the smallest error and its combination.

error_dict = {}
for column in df_ply:
    reg = LinearRegression()
    reg.fit(df_ply[[column]], y_ply)
    error = mean_squared_error(reg.predict(df_ply[[column]]), y_ply)
    error_dict[error] = column
print("the smallest mean squared error is", min(error_dict), 'from column', error_dict[min(error_dict)])

the smallest mean squared error is 23173652154.675903 from column Radius(R/Ro)

Here we can see the lowest mean squred error is around 2.3 * 10^10, and the linear combinaiton is Radius^1 * Temperature^0

The error is very large and a possible reason for that is that all star types are evaluated together and their ranges are in very different scales. As a result, different star types are evaluated separated below.

alt.Chart(df).mark_boxplot(extent='min-max').encode(
    x='Star type:N',
    y='Luminosity(L/Lo):Q'
)

In the plotbox above, it is apparent that the ranges of luminosity of different star types are in very different scale

def find_combination(star_type):
    df_star = df[df['Star type'] == star_type].iloc[:,:3]
    X = df_star[['Temperature (K)', 'Radius(R/Ro)']]
    y = df_star['Luminosity(L/Lo)']
    poly = PolynomialFeatures(degree=9)
    df_ply = pd.DataFrame(poly.fit_transform(X))
    df_ply.columns = poly.get_feature_names_out()
    error_dict = {}
    for column in df_ply:
        reg = LinearRegression()
        reg.fit(df_ply[[column]], y)
        error = mean_squared_error(reg.predict(df_ply[[column]]), y)
        error_dict[error] = column
    print(f"For the star type {star_type}, the smallest error is {min(error_dict)}, which is generagted form {error_dict[min(error_dict)]}")

for i in range(5):
    find_combination(i)

For the star type 0, the smallest error is 7.157957086249859e-07, which is generagted form Temperature (K)^2
For the star type 1, the smallest error is 5.09406889887359e-05, which is generagted form Temperature (K)^4 Radius(R/Ro)^5
For the star type 2, the smallest error is 3.540347414239562e-05, which is generagted form Temperature (K)^9
For the star type 3, the smallest error is 714292055.875457, which is generagted form Temperature (K)^4
For the star type 4, the smallest error is 24439177145.3056, which is generagted form Temperature (K)^9

After applying polynomialfeatures to different star type separated, the mean squared error reduced apparently. However, different star type has lowest error with different polynomial combination. As a result, it is not safe to claim any polynomial combination of temperature and radius is the best to predict the Luminosity.

Summary¶

In this project, I am able to predic the star’s type by using KneighborClasifier with comparatively high acurracy. However, a best polynomial combination of temperature and radius to predic the luminorsity is not find, because the best structures of different star types differ. As a result, a larger dataset is needed to get a more accurate result.

References¶

The dataset “6 class csv.csv” was adapted from Star dataset to predict star types

The mthods and application of polynomialfeature was adapted from sklearn.preprocessing.PolynomialFeatures

The idea of polynomialfeature is adapted from Introduction to Polynomial Regression (with Python Implementation)

The code of drawing altair histogram is adapted from Simple Histogram

The code of drawing boxplot is adapted from Boxplot with Min/Max Whiskers

Created in Deepnote

UC Irvine Math 10 W22

Star Type

Contents