Students Performance#

Author: Qi Qin

Course Project, UC Irvine, Math 10, S23

Introduction#

My topic is about students’ performance. My dataset includes anonymous student information like sex, parental education information, math scores, reading scores, writing scores, and unnamed race groups (group A to E). My goal is trying to relate students’ one score to the other two scores as well as predicting the race group by the three scores to see if race influences students’ talents or abilities somehow.

Main part#

To import libraries:

import pandas as pd
import altair as alt
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.linear_model import LinearRegression
from pandas.api.types import is_numeric_dtype
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

To set up the dataset: We can see that the data contains 1000 rows and 9 columns.

df = pd.read_csv("Student Performance.csv")
df.dropna(axis = 0)
df.shape
(1000, 9)

To display the data:

df.head()
Unnamed: 0 race/ethnicity parental level of education lunch test preparation course math percentage reading score percentage writing score percentage sex
0 0 group B bachelor's degree standard none 0.72 0.72 0.74 F
1 1 group C some college standard completed 0.69 0.90 0.88 F
2 2 group B master's degree standard none 0.90 0.95 0.93 F
3 3 group A associate's degree free/reduced none 0.47 0.57 0.44 M
4 4 group C some college standard none 0.76 0.78 0.75 M

To make column names clearer and more straight-forward:

df.rename(columns = {"Unnamed: 0": "id"}, inplace = True)
df.rename(columns = {"race/ethnicity": "race"}, inplace = True)
df.rename(columns = {"math percentage": "math score"}, inplace = True)
df.rename(columns = {"reading score percentage": "reading score"}, inplace = True)
df.rename(columns = {"writing score percentage": "writing score"}, inplace = True)

To know some attributes and learn the number of different values in each column of the data (This line is suggested by chatgpt):

#suggested by chatgpt
df[df.columns].nunique()
id                             1000
race                              5
parental level of education       6
lunch                             2
test preparation course           2
math score                       81
reading score                    72
writing score                    77
sex                               2
dtype: int64
df.dtypes
id                               int64
race                            object
parental level of education     object
lunch                           object
test preparation course         object
math score                     float64
reading score                  float64
writing score                  float64
sex                             object
dtype: object
df
id race parental level of education lunch test preparation course math score reading score writing score sex
0 0 group B bachelor's degree standard none 0.72 0.72 0.74 F
1 1 group C some college standard completed 0.69 0.90 0.88 F
2 2 group B master's degree standard none 0.90 0.95 0.93 F
3 3 group A associate's degree free/reduced none 0.47 0.57 0.44 M
4 4 group C some college standard none 0.76 0.78 0.75 M
... ... ... ... ... ... ... ... ... ...
995 995 group E master's degree standard completed 0.88 0.99 0.95 F
996 996 group C high school free/reduced none 0.62 0.55 0.55 M
997 997 group C high school free/reduced completed 0.59 0.71 0.65 F
998 998 group D some college standard completed 0.68 0.78 0.77 F
999 999 group D some college free/reduced none 0.77 0.86 0.86 F

1000 rows Ă— 9 columns

To find some correlations:

The numbers in the matrix show the relationship between its upper and left column names. The range is from 0 to 1. Large number represents the strong relationship and small number like 0.05ish represents very weak relationship or no any correlations.

From the matrix right now, we can see that id has nearly no relationship between any of the three scores as the numbers are around 0.05, which is pretty small.

Only numeric columns could use corr to analyze their relationships so the matrix is 4 by 4; however, I also want to analyze something more like sex and race, so I will try to change those columns to numeric types to find something meaningful to analyze.

*corr is the new thing I used, by the matrix, I could find something important to determine which race group the student is in. This function could be a start for further analysis.

relation = df.corr()
relation
id math score reading score writing score
id 1.000000 0.036827 0.048390 0.057848
math score 0.036827 1.000000 0.817580 0.802642
reading score 0.048390 0.817580 1.000000 0.954598
writing score 0.057848 0.802642 0.954598 1.000000

To change other columns to something suitable for the “corr” function:

From nunique function, I could find that “race” and “parental level of education” columns have more than 2 unique values, but “lunch”, “prep course”, and “sex” columns have only 2 different values.

So I would like to add the copy of these columns and change “race” and “parental level of education” value to some integers. Also, I will change “lunch”, “prep course”, “sex” values directly to True and False.

Finally, import is_numeric _dtype to check whether the result is proper.

df["parental level of education"].unique()
array(["bachelor's degree", 'some college', "master's degree",
       "associate's degree", 'high school', 'some high school'],
      dtype=object)
df["n_race"] = df["race"].replace(
    ["group A", "group B","group C","group D","group E"],
    [1,2,3,4,5])

df["n_ploe"] = df["parental level of education"].replace(
    ["some high school", "high school", "some college", "associate's degree", "bachelor's degree", "master's degree"], 
    [1,2,3,4,5,6])

df.rename(columns = {"lunch":"standardMeal"}, inplace = True)
df["standardMeal"] = df["standardMeal"].replace(
    ["standard", "free/reduced"],[True, False])

df.rename(columns = {"test preparation course": "donePrep"}, inplace = True)
df["donePrep"] = df["donePrep"].replace(
    ["completed", "none"], [True, False])
    
df.rename(columns = {"sex": "isFemale"}, inplace = True)
df["isFemale"] = df["isFemale"]. replace(["F", "M"], [True, False])
df.head()
id race parental level of education standardMeal donePrep math score reading score writing score isFemale n_race n_ploe
0 0 group B bachelor's degree True False 0.72 0.72 0.74 True 2 5
1 1 group C some college True True 0.69 0.90 0.88 True 3 3
2 2 group B master's degree True False 0.90 0.95 0.93 True 2 6
3 3 group A associate's degree False False 0.47 0.57 0.44 False 1 4
4 4 group C some college True False 0.76 0.78 0.75 False 3 3
df.columns
Index(['id', 'race', 'parental level of education', 'standardMeal', 'donePrep',
       'math score', 'reading score', 'writing score', 'isFemale', 'n_race',
       'n_ploe'],
      dtype='object')
features = [cols for cols in df.columns if is_numeric_dtype(df[cols])]
features
['id',
 'standardMeal',
 'donePrep',
 'math score',
 'reading score',
 'writing score',
 'isFemale',
 'n_race',
 'n_ploe']

To find the full correlation matrix: From the matrix, we could get that: “writing score” is somehow related to columns other than “id”, which makes sense since student id could not determine students’ academic performance. The most related two columns are “math score” and the “writing score”. This result also makes sense since writing needs both reading abilities and math logic. Maybe I could do a linear regression model for estimating students’ math score.

Moreover, we could get the information that: for “race”, it is more likely to be related to the “math score”, “reading score”, and “writing score”. I would like to form a model for predicting race groups by these three elements.

re = df.corr()
re
id standardMeal donePrep math score reading score writing score isFemale n_race n_ploe
id 1.000000 -0.033348 0.028029 0.036827 0.048390 0.057848 0.043038 0.026590 -0.044672
standardMeal -0.033348 1.000000 -0.017044 0.350877 0.229560 0.245769 -0.021372 0.046563 -0.023259
donePrep 0.028029 -0.017044 1.000000 0.177702 0.241780 0.312946 -0.006028 0.017508 -0.007143
math score 0.036827 0.350877 0.177702 1.000000 0.817580 0.802642 -0.167982 0.216415 0.159432
reading score 0.048390 0.229560 0.241780 0.817580 1.000000 0.954598 0.244313 0.145253 0.190908
writing score 0.057848 0.245769 0.312946 0.802642 0.954598 1.000000 0.301225 0.165691 0.236715
isFemale 0.043038 -0.021372 -0.006028 -0.167982 0.244313 0.301225 1.000000 0.001502 0.043934
n_race 0.026590 0.046563 0.017508 0.216415 0.145253 0.165691 0.001502 1.000000 0.095906
n_ploe -0.044672 -0.023259 -0.007143 0.159432 0.190908 0.236715 0.043934 0.095906 1.000000

Linear Regression for estimating writing scores: I would like to know if math is a better predictor or writing is better for prediction.

From the graph, it seems like the blue dots are more concentrated. To verify the intuition that “reading score” may do well, I built two linear regression model and compare their mean absolute values and mean squared values.

c1 = alt.Chart(df).mark_circle(color = "red").encode(
    x = "math score",
    y="writing score"
)
c2 = alt.Chart(df).mark_circle(color = "blue").encode(
    x = "reading score",
    y = "writing score"
)
c1+c2
regm = LinearRegression()
regr = LinearRegression()

regm.fit(df[["math score"]], df["writing score"])
regr.fit(df[["reading score"]], df["writing score"])

df["predm"] = regm.predict(df[["math score"]])
df["predr"] = regr.predict(df[["reading score"]])
m_mae = mean_absolute_error(df["writing score"], df["predm"])
r_mae = mean_absolute_error(df["writing score"], df["predr"])
m_mae
0.07535246410892701
r_mae
0.03614822225736518
m_mse = mean_squared_error(df["writing score"], df["predm"])
r_mse = mean_squared_error(df["writing score"], df["predr"])
m_mse
0.008206700489132093
r_mse
0.0020470863689389346

To check whether mae, mse function works well: Though we got False, I think the function worked well since 0.03614822225736518(r_mae) and 0.036148222257365116(ra) only have differences in 16th decimal place which may be led by trivial things like rounding up. We coud approximately say that they are the same, so do rs and r_mse.

ra = sum(abs(df["writing score"]-df["predr"]))/len(df)
ra
0.036148222257365116
rs = sum((df["writing score"]-df["predr"])**2)/len(df)
rs
0.002047086368938936
ra == r_mae
False
rs == r_mse
False

So we got the conclusion that “reading score ” is a better prediction for students’ “writing score”:

r_mae <= m_mae
True
r_mse <= m_mse
True

To visualize the findings:

c3 = alt.Chart(df).mark_line(color = "red").encode(
    x = "math score",
    y= "predm"
)
c4 = alt.Chart(df).mark_line(color = "blue").encode(
    x = "reading score",
    y = "predr"
)
c3+c4
c1+c2+c3+c4

To visualize in an new way “plotly.express” :

p1 = px.scatter(df, x = "math score", y = "writing score", color_discrete_sequence=["red"])
p2 = px.scatter(df, x = "reading score", y = "writing score", color_discrete_sequence=["blue"])
p3 = px.line(df,x = "math score", y = "predm", color_discrete_sequence= ["red"])
p4 = px.line(df,x = "math score", y = "predm", color_discrete_sequence= ["blue"])

To visualize in another new way “seaborn” :

s1 = sns.scatterplot(data = df, x = "math score", y = "writing score", color = "red")
s2 = sns.scatterplot(data = df, x = "reading score", y = "writing score", color = "blue")
s3 = sns.lineplot(data = df, x = "math score", y = "predm", color = "red")
s4 = sns.lineplot(data = df, x = "reading score", y = "predr", color = "blue")
../../_images/b34c60c6986cb51704729f2f1fc8bc77b6305e3f0430107133da94d1e71c5497.png

To predict the race group by academic performances:

col = ["math score", "reading score", "writing score"]
X_train, X_test, y_train, y_test = train_test_split(
    df[col],
    df["race"],
    train_size = 0.3
)

Fit in DecisionTreeClassifier to draw the tree diagram: The three scores I used to predict the race group are somehow useful but not perfect since when I set max_leaf_nodes = 500, the correction rate of training set is 0.99, but is 0.24 for the testing set. The 0.99 shows overfitting since it seems like the model does not have the ability to predict new values which means it does not understand the data structure but using too many nodes to learn the existed data well.

For conveniently analyzing the tree diagram, I changed the max of leaf nodes back to 7, a relatively small number which is easy to run.

clf = DecisionTreeClassifier(max_leaf_nodes = 7)
clf.fit(X_train, y_train)
a = clf.score(X_train, y_train)
b = clf.score(X_test, y_test)
a
0.9966666666666667
b
0.24285714285714285
fig = plt.figure(figsize=(20,10))
_ = plot_tree(clf, feature_names=clf.feature_names_in_, filled=True)
../../_images/5ecfcba13eec13d0d47b06e14cac191d50ba583c578836ac7341d511fc00b4b5.png

Pick two sets of data randomly to check whether the tree plot tells the truth:

df.sample(2, random_state = 3)
id race parental level of education standardMeal donePrep math score reading score writing score isFemale n_race n_ploe predm predr
642 642 group B some high school False False 0.72 0.81 0.79 True 2 1 0.728086 0.798085
762 762 group D some high school True True 0.78 0.81 0.86 False 4 1 0.776348 0.798085

First sample: No.642: math 0.72 < 0.725 go left reading 0.81 > 0.725 go right reading 0.81 > 0.755 go right Get: gini = 0.692, samples = 33, value = [1,10,13,8,1] Conclusion: It is most likely to be in group C and also likely to be in group B, but not likely to be in group A and E Real case: No.642 is in group B

We could get similar result by coding:0.393939 is the greatest number which represents that the student is most likely to be in group C

Pred642 = clf.predict_proba(pd.DataFrame({
    "math score": [0.72],
    "reading score": [0.81],
    "writing score": [0.79]
}))
Pred642
array([[0.03030303, 0.3030303 , 0.39393939, 0.24242424, 0.03030303]])

Second sample: No.762: math 0.78 > 0.725 go right Get: gini = 0.734, samples = 97, value = [2,11,32,27,25] Conclusion: It is most likely to be in group C and also likely to be in group D, but not likely to be in group A and B Real case: No.762 is in group D

We could get similar result by coding:0.32989691 is the greatest number which represents that the student is most likely to be in group C and second likely to be in group D.

Pred762 = clf.predict_proba(pd.DataFrame({
    "math score": [0.78],
    "reading score": [0.81],
    "writing score": [0.86]
}))
Pred762
array([[0.02061856, 0.11340206, 0.32989691, 0.27835052, 0.25773196]])

Result Analysis: We predicted wrong in both case. I think the reason could be that we only have 7 nodes here, which means it is possible that the result deviates from the actual situation a little bit. But the prediction makes some sense since both the true race group located in our predicted second possible category, which could be a good thing.

Moreover, it may deliver another information that the three scores are not that enough nor perfect for predicting students’ race. So race somehow affects students’ academic performance but not all the case and not the decisive element.

Summary#

  • Structure summary

First part introduces the possible correlations between two of every column names.

Second part is about using math score and reading score for writing score prediction. I built two models, compared their mae and mse to find a better one, and visualized them in three different ways.

Third part is the classification and the tree diagram of how the three scores determine the students’ races or not.

  • Analysis summary

Among the matrix, I found two things to deep into, which are how to predict writing score and how to predict the race group of students. Though writings may use both math logic abilities and reading abilities, it is better to predict their writing scores via reading scores instead of the math scores. Races make an influence on their math, reading, and writing scores somehow that the three variables could predict the basic trend of which race groups the student may in and which race groups the student must not in, but not decisive nor perfect.

References#

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote