Salary Classification By Using Decision Tree#

Author: Kevin Xu

Course Project, UC Irvine, Math 10, F22

Introduction#

I chose this dataset that counted people’s age, education level, situation in the household, type of company they work for, and whether their salary was over 50k, etc. The purpose of this project is to classify each person’s salary is whether higher than 50k by their age, education level, family situation, etc. I think this will help each employee plan their career path better and will also help company to better distribute the salary. And there are four steps in my project. First I will use pandas to clean the data, and second I will plot a lot of interesting chart to visualize my data to help people learn more about my data. Third, I will build the decision tree, and the last step is to use train_test_split to test the accuracy. And my project is supervised learning in machine learning, and I did it with classification.

Introduce Data#

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import altair as alt
import seaborn as sns
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import plot_tree
df = pd.read_csv("salary.csv")
df
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32556 27 Private 257302 Assoc-acdm 12 Married-civ-spouse Tech-support Wife White Female 0 0 38 United-States <=50K
32557 40 Private 154374 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States >50K
32558 58 Private 151910 HS-grad 9 Widowed Adm-clerical Unmarried White Female 0 0 40 United-States <=50K
32559 22 Private 201490 HS-grad 9 Never-married Adm-clerical Own-child White Male 0 0 20 United-States <=50K
32560 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States >50K

32561 rows × 15 columns

Here I introduce my data and import all necessary tools. My data records whether a person’s salary is higher than 50K. This record is analyzed according to many aspects of a person, such as age, workclass, education level, marital status and so on. Here is the full name of each column:

Columns are: age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: Female, Male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. salary: <=50K or >50K

Use Pandas Series to clean Data and then classify Data#

df.dropna()
df.duplicated().sum()
df.drop_duplicates(keep = 'first' , inplace=True)
for loc in df.columns:
    print(df[loc].value_counts())
36    898
31    888
34    886
23    876
35    875
     ... 
83      6
85      3
88      3
86      1
87      1
Name: age, Length: 73, dtype: int64
 Private             22673
 Self-emp-not-inc     2540
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64
164190    13
123011    13
203488    13
121124    12
148995    12
          ..
209392     1
218551     1
201204     1
362999     1
145522     1
Name: fnlwgt, Length: 21648, dtype: int64
 HS-grad         10494
 Some-college     7282
 Bachelors        5353
 Masters          1722
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           645
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           332
 1st-4th           166
 Preschool          50
Name: education, dtype: int64
9     10494
10     7282
13     5353
14     1722
11     1382
7      1175
12     1067
6       933
4       645
15      576
5       514
8       433
16      413
3       332
2       166
1        50
Name: education-num, dtype: int64
 Married-civ-spouse       14970
 Never-married            10667
 Divorced                  4441
 Separated                 1025
 Widowed                    993
 Married-spouse-absent      418
 Married-AF-spouse           23
Name: marital-status, dtype: int64
 Prof-specialty       4136
 Craft-repair         4094
 Exec-managerial      4065
 Adm-clerical         3768
 Sales                3650
 Other-service        3291
 Machine-op-inspct    2000
 ?                    1843
 Transport-moving     1597
 Handlers-cleaners    1369
 Farming-fishing       992
 Tech-support          927
 Protective-serv       649
 Priv-house-serv       147
 Armed-Forces            9
Name: occupation, dtype: int64
 Husband           13187
 Not-in-family      8292
 Own-child          5064
 Unmarried          3445
 Wife               1568
 Other-relative      981
Name: relationship, dtype: int64
 White                 27795
 Black                  3122
 Asian-Pac-Islander     1038
 Amer-Indian-Eskimo      311
 Other                   271
Name: race, dtype: int64
 Male      21775
 Female    10762
Name: sex, dtype: int64
0        29825
15024      347
7688       284
7298       246
99999      159
         ...  
1639         1
5060         1
6097         1
1455         1
7978         1
Name: capital-gain, Length: 119, dtype: int64
0       31018
1902      202
1977      168
1887      159
1485       51
        ...  
2467        1
1539        1
155         1
2282        1
1411        1
Name: capital-loss, Length: 92, dtype: int64
40    15204
50     2817
45     1823
60     1475
35     1296
      ...  
92        1
74        1
94        1
82        1
87        1
Name: hours-per-week, Length: 94, dtype: int64
 United-States                 29153
 Mexico                          639
 ?                               582
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Japan                            62
 Guatemala                        62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 France                           29
 Greece                           29
 Ecuador                          28
 Ireland                          24
 Hong                             20
 Cambodia                         19
 Trinadad&Tobago                  19
 Laos                             18
 Thailand                         18
 Yugoslavia                       16
 Outlying-US(Guam-USVI-etc)       14
 Honduras                         13
 Hungary                          13
 Scotland                         12
 Holand-Netherlands                1
Name: native-country, dtype: int64
 <=50K    24698
 >50K      7839
Name: salary, dtype: int64

First I want to drop those missing values and then I want to drop those duplicate rows and remain the first one, also I want to use value.counts() to count the values in columns, after I count the value, I find that there are many question marks in columns “workclass”,”occupation”,”native-country”, and I want to replace those question marks with the most frequent value in each columns.

df['workclass'] = df['workclass'].str.replace('?', 'Private' )
df['occupation'] = df['occupation'].str.replace('?', 'Prof-specialty' )
df['native-country'] = df['native-country'].str.replace('?', 'United-States' )
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
  """Entry point for launching an IPython kernel.
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
  
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
  This is separate from the ipykernel package so we can avoid doing imports until

After I finished replacing question marks, I found that there were many different values in many columns, which was not conducive for me to do data visualization, so I needed to classify some different values according to my needs. For example, I will divide workclass into four categories and education into five categories. I only classify three columns because I need to use other columns to make charts.

df["workclass"].replace(["Self-emp-not-inc","Self-emp-inc"],"self-emp",inplace = True)
df["workclass"].replace(["Federal-gov","Local-gov"],"gov",inplace = True,regex = True)
df["workclass"].replace(["Without-pay","Never-worked"],"unemp",inplace=True,regex=True)
df['education'].replace(['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th','10th', '11th', '12th','HS-grad'], 'lower' ,inplace = True , regex = True)
df['education'].replace(['Assoc-voc', 'Assoc-acdm', 'Prof-school', 'Some-college'], 'medium' , inplace = True , regex = True)
df['marital-status'].replace(['Married-civ-spouse', 'Married-AF-spouse'], 'married' , inplace=True , regex = True)
df['marital-status'].replace(['Divorced', 'Separated','Widowed', 'Married-spouse-absent' , 'Never-married'] ,"single",inplace = True,regex=True)

Data visualization#

In order to better understand the data, I make a lot of interesting charts

alt.Chart(df.sample(5000)).mark_bar().encode(
    x = alt.X("education",scale = alt.Scale(zero=False)),
    y=alt.Y("fnlwgt",scale=alt.Scale(zero=False)),
    color = "salary"
).facet(
    "sex"
)
alt.Chart(df.sample(5000)).mark_bar().encode(
    x=alt.X('count()', stack="normalize"),
    y='education-num',
    color='salary'
).facet(
    "sex"
)

Here I make two charts.Since Chart only allows 5000 rows, so we can use 5000 random rows from df, In the chart above, I use salary to mark the color and facet to divide the chart into male and female. As we can see, a higher degree means you are more likely to earn more than 50k, and almost all doctoral degrees pay more than 50k. But for women, wages are generally lower than for men.

sns.displot(data=df, x="age", hue="salary", kind="kde", height=6, aspect=1)
<seaborn.axisgrid.FacetGrid at 0x7f17b8401950>
../../_images/KevinXu_20_1.png
alt.Chart(df.sample(5000)).mark_point().encode(
    x="age",
    y='fnlwgt',
    color='salary',
    tooltip='education-num'
).facet(
    "sex"
).interactive()

Here I make two charts one is by using seaborn and the other one is by using altair, and also the second chart is interactive chart.As we can see from the charts, a large number of people get a salary of more than 50k when they are between 20 and 40 years old, but the number begins to decline after the age of 40

plt.figure(figsize=(12,5))
sns.countplot(data =df , x = 'workclass', hue = 'salary')
plt.show()
../../_images/KevinXu_23_0.png

According to the bar graph, we can analyze that there are more people working in private enterprises, but from the proportion of count people, I think self-employees are more likely to get a salary of more than 50k

for i in df['native-country'] :
    if i != ' United-States':
        df['native-country'].replace([i] , 'Others' , inplace = True)
plt.pie(df['native-country'].value_counts() , labels = ['US' ,'Others'] , autopct = '%0.2f')
plt.show()
../../_images/KevinXu_25_0.png

Here I classiy native-country columns to two variables in order to make a better pie. chart. From the pie chart, we can see that most of the data come from United States, So I think this data is not representative, we can ignore the impression of regions for a moment, we can consider all regions as the United States

encoder = LabelEncoder()
df["salary_num"] = df["salary"]
df['salary_num'] = encoder.fit_transform(df['salary_num'])
sns.histplot(df[df['salary_num'] ==0]['marital-status'])
sns.histplot(df[df['salary_num'] ==1]['marital-status'] , color='red')
<AxesSubplot:xlabel='marital-status', ylabel='Count'>
../../_images/KevinXu_27_1.png

I use laberencoder to mark salary to 0 and 1, so it is easier for me to make the chart. And I make a new column salary_num here becasue I need to drop this column when I use train_test_split. From the Chart we can see that, most of married people earn more than 50k. I use seaborn to make this chart, because I think by using this method we can see more clear.

sns.kdeplot(data=df, x='hours-per-week', hue='salary')
<AxesSubplot:xlabel='hours-per-week', ylabel='Density'>
../../_images/KevinXu_29_1.png

Because of laws, most of people work 40 hours a week, so this is real common.

plt.figure(figsize=(11,7))
sns.heatmap(df.corr(),annot=True)
plt.show()
../../_images/KevinXu_31_0.png

The heat map looks good.

Decison Tree Classifier#

encoder = LabelEncoder()
df = df.drop(["salary_num"],axis =1)
df['sex'] = encoder.fit_transform(df['sex'])
df['workclass'] = encoder.fit_transform(df["workclass"])
df["marital-status"] = encoder.fit_transform(df["marital-status"])
df["race"] = encoder.fit_transform(df["race"])
df["education"] = encoder.fit_transform(df["education"])
df["occupation"] = encoder.fit_transform(df["occupation"])
df["native-country"] = encoder.fit_transform(df["native-country"])
df["relationship"] = encoder.fit_transform(df["relationship"])

Here, we all know that we need to convert string values to integer values to divide the data and build trees. So here I use laberEncoder to convert my data. And I will post the web link in reference.

input_cols = [c for c in df.columns if c != "salary"]
X = df[input_cols]
y = df["salary"]
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.9,random_state=0)

This step is used to split the data, I want to set “salary” column to y and others to X. And I use 90% of the train_size

clf = DecisionTreeClassifier(max_leaf_nodes=30, max_depth= 20)
clf.fit(X_train,y_train)
fig = plt.figure(figsize=(20,10))
_=plot_tree(clf,
feature_names=clf.feature_names_in_,
class_names=clf.classes_,
filled=True)
../../_images/KevinXu_38_0.png

This step is used to build my decision tree. If there is a new person, we can go through this decision tree to classify his condition and finally get his salary whether it is above 50k or below 50k

pd.Series(clf.feature_importances_,index=clf.feature_names_in_).sort_values(ascending = True)
workclass         0.000000
relationship      0.000000
race              0.000000
sex               0.000000
native-country    0.000000
fnlwgt            0.003259
occupation        0.004738
education         0.013617
hours-per-week    0.027949
age               0.035311
capital-loss      0.056571
capital-gain      0.205799
education-num     0.221222
marital-status    0.431533
dtype: float64

This step is to rank feature importance.

print(clf.score(X_train,y_train))
print(clf.score(X_test,y_test))
accuracy = clf.score(X_test,y_test)*100
f"Accuracy on Test Data : {accuracy} %."
0.8570501656251067
0.8607867240319607
'Accuracy on Test Data : 86.07867240319607 %.'

Here I print train and test accuaracy by using f-string, and the result seems good.

Summary#

Either summarize what you did, or summarize the results. Maybe 3 sentences.

In my project, I first clean the data, and then I did the data visualization. Then I builed decision tree, after that I test the accuracy. According to the decision tree I made, if there is a new person, we can classify him according to his different conditions and eventually we can determine whether his salary will be higher than 50k.

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

Dataset source: https://www.kaggle.com/datasets/ayessa/salary-prediction-classification

  • List any other references that you found helpful.

LabelEncoder: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html heatmap : https://www.geeksforgeeks.org/display-the-pandas-dataframe-in-heatmap-style/ seaborn: https://www.section.io/engineering-education/seaborn-tutorial/

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

https://deepnote.com/workspace/personal-workspace-1d4e-9689-6b3764b5-4ff4-4c89-a8e5-aadbe2a1fd0c/project/Project-Template-Duplicate-e5710c39-a200-48fe-b5a8-872f4ca47b22//CourseProject.ipynb

Created in deepnote.com Created in Deepnote