Students Performance

Students Performance#

Author: Qi Qin

Course Project, UC Irvine, Math 10, S23

Introduction#

My topic is about students’ performance. My dataset includes anonymous student information like sex, parental education information, math scores, reading scores, writing scores, and unnamed race groups (group A to E). My goal is trying to relate students’ one score to the other two scores as well as predicting the race group by the three scores to see if race influences students’ talents or abilities somehow.

Main part#

To import libraries:

import pandas as pd
import altair as alt
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.linear_model import LinearRegression
from pandas.api.types import is_numeric_dtype
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

To set up the dataset: We can see that the data contains 1000 rows and 9 columns.

df = pd.read_csv("Student Performance.csv")
df.dropna(axis = 0)
df.shape

(1000, 9)

To display the data:

df.head()

	Unnamed: 0	race/ethnicity	parental level of education	lunch	test preparation course	math percentage	reading score percentage	writing score percentage	sex
0	0	group B	bachelor's degree	standard	none	0.72	0.72	0.74	F
1	1	group C	some college	standard	completed	0.69	0.90	0.88	F
2	2	group B	master's degree	standard	none	0.90	0.95	0.93	F
3	3	group A	associate's degree	free/reduced	none	0.47	0.57	0.44	M
4	4	group C	some college	standard	none	0.76	0.78	0.75	M

To make column names clearer and more straight-forward:

df.rename(columns = {"Unnamed: 0": "id"}, inplace = True)
df.rename(columns = {"race/ethnicity": "race"}, inplace = True)
df.rename(columns = {"math percentage": "math score"}, inplace = True)
df.rename(columns = {"reading score percentage": "reading score"}, inplace = True)
df.rename(columns = {"writing score percentage": "writing score"}, inplace = True)

To know some attributes and learn the number of different values in each column of the data (This line is suggested by chatgpt):

#suggested by chatgpt
df[df.columns].nunique()

id                             1000
race                              5
parental level of education       6
lunch                             2
test preparation course           2
math score                       81
reading score                    72
writing score                    77
sex                               2
dtype: int64

df.dtypes

id                               int64
race                            object
parental level of education     object
lunch                           object
test preparation course         object
math score                     float64
reading score                  float64
writing score                  float64
sex                             object
dtype: object

df

	id	race	parental level of education	lunch	test preparation course	math score	reading score	writing score	sex
0	0	group B	bachelor's degree	standard	none	0.72	0.72	0.74	F
1	1	group C	some college	standard	completed	0.69	0.90	0.88	F
2	2	group B	master's degree	standard	none	0.90	0.95	0.93	F
3	3	group A	associate's degree	free/reduced	none	0.47	0.57	0.44	M
4	4	group C	some college	standard	none	0.76	0.78	0.75	M
...	...	...	...	...	...	...	...	...	...
995	995	group E	master's degree	standard	completed	0.88	0.99	0.95	F
996	996	group C	high school	free/reduced	none	0.62	0.55	0.55	M
997	997	group C	high school	free/reduced	completed	0.59	0.71	0.65	F
998	998	group D	some college	standard	completed	0.68	0.78	0.77	F
999	999	group D	some college	free/reduced	none	0.77	0.86	0.86	F

1000 rows × 9 columns

To find some correlations:

The numbers in the matrix show the relationship between its upper and left column names. The range is from 0 to 1. Large number represents the strong relationship and small number like 0.05ish represents very weak relationship or no any correlations.

From the matrix right now, we can see that id has nearly no relationship between any of the three scores as the numbers are around 0.05, which is pretty small.

Only numeric columns could use corr to analyze their relationships so the matrix is 4 by 4; however, I also want to analyze something more like sex and race, so I will try to change those columns to numeric types to find something meaningful to analyze.

*corr is the new thing I used, by the matrix, I could find something important to determine which race group the student is in. This function could be a start for further analysis.

relation = df.corr()

relation

	id	math score	reading score	writing score
id	1.000000	0.036827	0.048390	0.057848
math score	0.036827	1.000000	0.817580	0.802642
reading score	0.048390	0.817580	1.000000	0.954598
writing score	0.057848	0.802642	0.954598	1.000000

To change other columns to something suitable for the “corr” function:

From nunique function, I could find that “race” and “parental level of education” columns have more than 2 unique values, but “lunch”, “prep course”, and “sex” columns have only 2 different values.

So I would like to add the copy of these columns and change “race” and “parental level of education” value to some integers. Also, I will change “lunch”, “prep course”, “sex” values directly to True and False.

Finally, import is_numeric _dtype to check whether the result is proper.

df["parental level of education"].unique()

array(["bachelor's degree", 'some college', "master's degree",
       "associate's degree", 'high school', 'some high school'],
      dtype=object)

df["n_race"] = df["race"].replace(
    ["group A", "group B","group C","group D","group E"],
    [1,2,3,4,5])

df["n_ploe"] = df["parental level of education"].replace(
    ["some high school", "high school", "some college", "associate's degree", "bachelor's degree", "master's degree"], 
    [1,2,3,4,5,6])

df.rename(columns = {"lunch":"standardMeal"}, inplace = True)
df["standardMeal"] = df["standardMeal"].replace(
    ["standard", "free/reduced"],[True, False])

df.rename(columns = {"test preparation course": "donePrep"}, inplace = True)
df["donePrep"] = df["donePrep"].replace(
    ["completed", "none"], [True, False])
    
df.rename(columns = {"sex": "isFemale"}, inplace = True)
df["isFemale"] = df["isFemale"]. replace(["F", "M"], [True, False])

df.head()

	id	race	parental level of education	standardMeal	donePrep	math score	reading score	writing score	isFemale	n_race	n_ploe
0	0	group B	bachelor's degree	True	False	0.72	0.72	0.74	True	2	5
1	1	group C	some college	True	True	0.69	0.90	0.88	True	3	3
2	2	group B	master's degree	True	False	0.90	0.95	0.93	True	2	6
3	3	group A	associate's degree	False	False	0.47	0.57	0.44	False	1	4
4	4	group C	some college	True	False	0.76	0.78	0.75	False	3	3

df.columns

Index(['id', 'race', 'parental level of education', 'standardMeal', 'donePrep',
       'math score', 'reading score', 'writing score', 'isFemale', 'n_race',
       'n_ploe'],
      dtype='object')

features = [cols for cols in df.columns if is_numeric_dtype(df[cols])]
features

['id',
 'standardMeal',
 'donePrep',
 'math score',
 'reading score',
 'writing score',
 'isFemale',
 'n_race',
 'n_ploe']

To find the full correlation matrix: From the matrix, we could get that: “writing score” is somehow related to columns other than “id”, which makes sense since student id could not determine students’ academic performance. The most related two columns are “math score” and the “writing score”. This result also makes sense since writing needs both reading abilities and math logic. Maybe I could do a linear regression model for estimating students’ math score.

Moreover, we could get the information that: for “race”, it is more likely to be related to the “math score”, “reading score”, and “writing score”. I would like to form a model for predicting race groups by these three elements.

re = df.corr()
re

	id	standardMeal	donePrep	math score	reading score	writing score	isFemale	n_race	n_ploe
id	1.000000	-0.033348	0.028029	0.036827	0.048390	0.057848	0.043038	0.026590	-0.044672
standardMeal	-0.033348	1.000000	-0.017044	0.350877	0.229560	0.245769	-0.021372	0.046563	-0.023259
donePrep	0.028029	-0.017044	1.000000	0.177702	0.241780	0.312946	-0.006028	0.017508	-0.007143
math score	0.036827	0.350877	0.177702	1.000000	0.817580	0.802642	-0.167982	0.216415	0.159432
reading score	0.048390	0.229560	0.241780	0.817580	1.000000	0.954598	0.244313	0.145253	0.190908
writing score	0.057848	0.245769	0.312946	0.802642	0.954598	1.000000	0.301225	0.165691	0.236715
isFemale	0.043038	-0.021372	-0.006028	-0.167982	0.244313	0.301225	1.000000	0.001502	0.043934
n_race	0.026590	0.046563	0.017508	0.216415	0.145253	0.165691	0.001502	1.000000	0.095906
n_ploe	-0.044672	-0.023259	-0.007143	0.159432	0.190908	0.236715	0.043934	0.095906	1.000000

Linear Regression for estimating writing scores: I would like to know if math is a better predictor or writing is better for prediction.

From the graph, it seems like the blue dots are more concentrated. To verify the intuition that “reading score” may do well, I built two linear regression model and compare their mean absolute values and mean squared values.

c1 = alt.Chart(df).mark_circle(color = "red").encode(
    x = "math score",
    y="writing score"
)
c2 = alt.Chart(df).mark_circle(color = "blue").encode(
    x = "reading score",
    y = "writing score"
)

c1+c2

regm = LinearRegression()
regr = LinearRegression()

regm.fit(df[["math score"]], df["writing score"])
regr.fit(df[["reading score"]], df["writing score"])

df["predm"] = regm.predict(df[["math score"]])
df["predr"] = regr.predict(df[["reading score"]])

m_mae = mean_absolute_error(df["writing score"], df["predm"])
r_mae = mean_absolute_error(df["writing score"], df["predr"])

m_mae

0.07535246410892701

r_mae

0.03614822225736518

m_mse = mean_squared_error(df["writing score"], df["predm"])
r_mse = mean_squared_error(df["writing score"], df["predr"])

m_mse

0.008206700489132093

r_mse

0.0020470863689389346

To check whether mae, mse function works well: Though we got False, I think the function worked well since 0.03614822225736518(r_mae) and 0.036148222257365116(ra) only have differences in 16th decimal place which may be led by trivial things like rounding up. We coud approximately say that they are the same, so do rs and r_mse.

ra = sum(abs(df["writing score"]-df["predr"]))/len(df)
ra

0.036148222257365116

rs = sum((df["writing score"]-df["predr"])**2)/len(df)
rs

0.002047086368938936

ra == r_mae

False

rs == r_mse

False

So we got the conclusion that “reading score ” is a better prediction for students’ “writing score”:

r_mae <= m_mae

True

r_mse <= m_mse

True

To visualize the findings:

c3 = alt.Chart(df).mark_line(color = "red").encode(
    x = "math score",
    y= "predm"
)
c4 = alt.Chart(df).mark_line(color = "blue").encode(
    x = "reading score",
    y = "predr"
)

c3+c4

c1+c2+c3+c4

To visualize in an new way “plotly.express” :

p1 = px.scatter(df, x = "math score", y = "writing score", color_discrete_sequence=["red"])
p2 = px.scatter(df, x = "reading score", y = "writing score", color_discrete_sequence=["blue"])
p3 = px.line(df,x = "math score", y = "predm", color_discrete_sequence= ["red"])
p4 = px.line(df,x = "math score", y = "predm", color_discrete_sequence= ["blue"])

To visualize in another new way “seaborn” :

s1 = sns.scatterplot(data = df, x = "math score", y = "writing score", color = "red")
s2 = sns.scatterplot(data = df, x = "reading score", y = "writing score", color = "blue")
s3 = sns.lineplot(data = df, x = "math score", y = "predm", color = "red")
s4 = sns.lineplot(data = df, x = "reading score", y = "predr", color = "blue")

../../_images/b34c60c6986cb51704729f2f1fc8bc77b6305e3f0430107133da94d1e71c5497.png

To predict the race group by academic performances:

col = ["math score", "reading score", "writing score"]
X_train, X_test, y_train, y_test = train_test_split(
    df[col],
    df["race"],
    train_size = 0.3
)

Fit in DecisionTreeClassifier to draw the tree diagram: The three scores I used to predict the race group are somehow useful but not perfect since when I set max_leaf_nodes = 500, the correction rate of training set is 0.99, but is 0.24 for the testing set. The 0.99 shows overfitting since it seems like the model does not have the ability to predict new values which means it does not understand the data structure but using too many nodes to learn the existed data well.

For conveniently analyzing the tree diagram, I changed the max of leaf nodes back to 7, a relatively small number which is easy to run.

clf = DecisionTreeClassifier(max_leaf_nodes = 7)
clf.fit(X_train, y_train)
a = clf.score(X_train, y_train)
b = clf.score(X_test, y_test)

0.9966666666666667

0.24285714285714285

fig = plt.figure(figsize=(20,10))
_ = plot_tree(clf, feature_names=clf.feature_names_in_, filled=True)

../../_images/5ecfcba13eec13d0d47b06e14cac191d50ba583c578836ac7341d511fc00b4b5.png

Pick two sets of data randomly to check whether the tree plot tells the truth:

df.sample(2, random_state = 3)

	id	race	parental level of education	standardMeal	donePrep	math score	reading score	writing score	isFemale	n_race	n_ploe	predm	predr
642	642	group B	some high school	False	False	0.72	0.81	0.79	True	2	1	0.728086	0.798085
762	762	group D	some high school	True	True	0.78	0.81	0.86	False	4	1	0.776348	0.798085

First sample: No.642: math 0.72 < 0.725 go left reading 0.81 > 0.725 go right reading 0.81 > 0.755 go right Get: gini = 0.692, samples = 33, value = [1,10,13,8,1] Conclusion: It is most likely to be in group C and also likely to be in group B, but not likely to be in group A and E Real case: No.642 is in group B

We could get similar result by coding:0.393939 is the greatest number which represents that the student is most likely to be in group C

Pred642 = clf.predict_proba(pd.DataFrame({
    "math score": [0.72],
    "reading score": [0.81],
    "writing score": [0.79]
}))
Pred642

array([[0.03030303, 0.3030303 , 0.39393939, 0.24242424, 0.03030303]])

Second sample: No.762: math 0.78 > 0.725 go right Get: gini = 0.734, samples = 97, value = [2,11,32,27,25] Conclusion: It is most likely to be in group C and also likely to be in group D, but not likely to be in group A and B Real case: No.762 is in group D

We could get similar result by coding:0.32989691 is the greatest number which represents that the student is most likely to be in group C and second likely to be in group D.

Pred762 = clf.predict_proba(pd.DataFrame({
    "math score": [0.78],
    "reading score": [0.81],
    "writing score": [0.86]
}))
Pred762

array([[0.02061856, 0.11340206, 0.32989691, 0.27835052, 0.25773196]])

Result Analysis: We predicted wrong in both case. I think the reason could be that we only have 7 nodes here, which means it is possible that the result deviates from the actual situation a little bit. But the prediction makes some sense since both the true race group located in our predicted second possible category, which could be a good thing.

Moreover, it may deliver another information that the three scores are not that enough nor perfect for predicting students’ race. So race somehow affects students’ academic performance but not all the case and not the decisive element.

Summary#

Structure summary

First part introduces the possible correlations between two of every column names.

Second part is about using math score and reading score for writing score prediction. I built two models, compared their mae and mse to find a better one, and visualized them in three different ways.

Third part is the classification and the tree diagram of how the three scores determine the students’ races or not.

Analysis summary

Among the matrix, I found two things to deep into, which are how to predict writing score and how to predict the race group of students. Though writings may use both math logic abilities and reading abilities, it is better to predict their writing scores via reading scores instead of the math scores. Races make an influence on their math, reading, and writing scores somehow that the three variables could predict the basic trend of which race groups the student may in and which race groups the student must not in, but not decisive nor perfect.

References#

What is the source of your dataset(s)? From Kaggle: https://www.kaggle.com/datasets/spscientist/students-performance-in-exams

List any other references that you found helpful. I asked chatgpt for some help and I already mentioned them before I quoted. I found corr() useful in this kaggle website. https://www.kaggle.com/code/abhimanyudasarwar/used-cars-price-prediction I mastered Seaborn graphing techniques from this website. https://python-charts.com/correlation/scatter-plot-seaborn/#:~:text=In case you want to override the default,the markers with edgecolor, which defaults to white. I learned plotly graphing here with discrete colors. https://plotly.com/python/discrete-color/

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Deepnote