Data science occupation prediction#
Author: Anjun Hou
Course Project, UC Irvine, Math 10, S23
Introduction#
This project focuses on analyzing the income and salary of various samples in an attempt to predict job titles of individuals using various features such as company location, company size, salar, etc.
Visualizing the data#
import pandas as pd
import altair as alt
df = pd.read_csv('ds_salaries.csv')
df
work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2023 | SE | FT | Principal Data Scientist | 80000 | EUR | 85847 | ES | 100 | ES | L |
1 | 2023 | MI | CT | ML Engineer | 30000 | USD | 30000 | US | 100 | US | S |
2 | 2023 | MI | CT | ML Engineer | 25500 | USD | 25500 | US | 100 | US | S |
3 | 2023 | SE | FT | Data Scientist | 175000 | USD | 175000 | CA | 100 | CA | M |
4 | 2023 | SE | FT | Data Scientist | 120000 | USD | 120000 | CA | 100 | CA | M |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3750 | 2020 | SE | FT | Data Scientist | 412000 | USD | 412000 | US | 100 | US | L |
3751 | 2021 | MI | FT | Principal Data Scientist | 151000 | USD | 151000 | US | 100 | US | L |
3752 | 2020 | EN | FT | Data Scientist | 105000 | USD | 105000 | US | 100 | US | S |
3753 | 2020 | EN | CT | Business Data Analyst | 100000 | USD | 100000 | US | 100 | US | L |
3754 | 2021 | SE | FT | Data Science Manager | 7000000 | INR | 94665 | IN | 50 | IN | L |
3755 rows Ă— 11 columns
Since there are too many occupations (93 total), we will look at the top 6 occupations, since only the top 6 have over 100 entries.
df = df[df['job_title'].isin(df['job_title'].value_counts()[:6].keys())] # all the titles with over 100 people
df
work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | 2023 | SE | FT | Data Scientist | 175000 | USD | 175000 | CA | 100 | CA | M |
4 | 2023 | SE | FT | Data Scientist | 120000 | USD | 120000 | CA | 100 | CA | M |
7 | 2023 | SE | FT | Data Scientist | 219000 | USD | 219000 | CA | 0 | CA | M |
8 | 2023 | SE | FT | Data Scientist | 141000 | USD | 141000 | CA | 0 | CA | M |
9 | 2023 | SE | FT | Data Scientist | 147100 | USD | 147100 | US | 0 | US | M |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3744 | 2020 | SE | FT | Machine Learning Engineer | 40000 | EUR | 45618 | HR | 100 | HR | S |
3746 | 2021 | MI | FT | Data Scientist | 160000 | SGD | 119059 | SG | 100 | IL | M |
3748 | 2021 | MI | FT | Data Engineer | 24000 | EUR | 28369 | MT | 50 | MT | L |
3750 | 2020 | SE | FT | Data Scientist | 412000 | USD | 412000 | US | 100 | US | L |
3752 | 2020 | EN | FT | Data Scientist | 105000 | USD | 105000 | US | 100 | US | S |
2985 rows Ă— 11 columns
df['isFulltime_bool'] = (df['employment_type'] == 'FT')
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
Data visualization#
below are some histograms to demonstrate distribution of the occupations with respect to different columns
alt.Chart(df).mark_bar().encode(
x = 'job_title',
y = 'salary_in_usd',
tooltip = 'salary_in_usd'
).properties(
width=550,
height=250
)
Using Data scientists as an example, it could be seen that the highest portion of the data scientists are in the US. Although this could also be due to data collection bias of more datapoints in the US. The location is still none-the-less an important feature to consider with respect to salary and salary in USD.
alt.Chart(df[df['job_title'] == 'Data Scientist']).mark_bar().encode(
x = 'company_location',
y = 'count()',
tooltip = 'count()'
).properties(
width=550,
height=250,
title = 'Data scientist location distribution'
)
Model training#
In order to predict the occupation, we will use an random forest classifier and k-nearest neighbour model.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
In order to use the features that are not numerical, the one hot matrix can be applied so that each distinct feature choice can be an distinct variable, below is an example.
le = LabelEncoder()
oh = OneHotEncoder()
let = le.fit_transform(df['company_size'])
oht = oh.fit_transform(let.reshape(-1,1)).toarray()
df_temp = pd.DataFrame(oht, columns=le.inverse_transform(range(3)))
The df_temp dataframe has converted the series of L, M, S company sizes into 3 columns of boolean series depending on the company size, which the machine learning models can be trained with as an input feature.
df_temp
L | M | S | |
---|---|---|---|
0 | 0.0 | 1.0 | 0.0 |
1 | 0.0 | 1.0 | 0.0 |
2 | 0.0 | 1.0 | 0.0 |
3 | 0.0 | 1.0 | 0.0 |
4 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... |
2980 | 0.0 | 0.0 | 1.0 |
2981 | 0.0 | 1.0 | 0.0 |
2982 | 1.0 | 0.0 | 0.0 |
2983 | 1.0 | 0.0 | 0.0 |
2984 | 0.0 | 0.0 | 1.0 |
2985 rows Ă— 3 columns
def make_onehot(c_name):
"""
Create onehot matrix based on given column name, refer to source 2 for how this works
Input: column name
Output: onehot matrix of corresponding column
"""
le = LabelEncoder()
oh = OneHotEncoder()
let = le.fit_transform(df[c_name])
values = len(df[c_name].unique())
oht = oh.fit_transform(let.reshape(-1,1)).toarray()
df_ = pd.DataFrame(oht, columns=le.inverse_transform(range(values)))
# refer to source 3 for the labeling
return df_
one_hots = ['experience_level', 'company_size', 'company_location']
features = ['salary_in_usd', 'salary', 'isFulltime_bool']
df_feats = df[features]
for col in one_hots:
df_feats = pd.concat([df_feats.reset_index(drop=True), make_onehot(col).reset_index(drop=True)], axis=1)
# refer to source 1 for reset_index
Currently with all the different one hot matrix added, the features data frame looks like this
df_feats
salary_in_usd | salary | isFulltime_bool | EN | EX | MI | SE | L | M | S | ... | PR | PT | RO | SG | SI | TH | TR | UA | US | VN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 175000 | 175000 | True | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 120000 | 120000 | True | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 219000 | 219000 | True | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 141000 | 141000 | True | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 147100 | 147100 | True | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2980 | 45618 | 40000 | True | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2981 | 119059 | 160000 | True | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2982 | 28369 | 24000 | True | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2983 | 412000 | 412000 | True | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2984 | 105000 | 105000 | True | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2985 rows Ă— 62 columns
Model making#
Below is how the model and training data is made. RFC uses a collection of decision trees to classify given inputs. KNN chooses the average of different clusters to deterine which group the given inputs belong in.
rfc = RandomForestClassifier(n_estimators=140, max_leaf_nodes=180) # instantiate
X_train, X_test, y_train, y_test = train_test_split(df_feats, df['job_title']) # split
rfc.fit(X_train, y_train) # fit
RandomForestClassifier(max_leaf_nodes=180, n_estimators=140)
Below are the scoring results for train and test sets.
rfc.score(X_train, y_train)
0.7010723860589813
rfc.score(X_test, y_test)
# 52% isn't bad
0.5060240963855421
The difference in score is about 17%, which is understandable. A difference of score at this degree should not be considered overfitting.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
KNeighborsClassifier()
knn.score(X_test, y_test)
# not as good
0.4404283801874163
Summary#
From the analysis of the KNN and RFC models, it can be said that there is no definitive good predicitions from the model. Although 53% from the RFC isn’t bad, it is not incredibly accurate either. The KNN falls a bit behind with 43%, but it serves to demonstrate the advantages of an RFC. Below is a confusion matrix analysing which were the most common mistakes of the models.
df['rpred'] = rfc.predict(df_feats)
df['kpred'] = knn.predict(df_feats)
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
alt.data_transformers.enable('default', max_rows=15000)
c = alt.Chart(df).mark_rect().encode(
x="job_title:N",
y="rpred:N",
color=alt.Color("count()", scale=alt.Scale(scheme="tableau20"))
)
c_text = alt.Chart(df).mark_text(color="white").encode(
x="job_title:N",
y="rpred:N",
text="count()"
)
(c+c_text).properties(
height=400,
width=400,
title = 'RFC prediction confusion matrix'
)
alt.data_transformers.enable('default', max_rows=15000)
c = alt.Chart(df).mark_rect().encode(
x="job_title:N",
y="kpred:N",
color=alt.Color("count()", scale=alt.Scale(scheme="tableau20"))
)
c_text = alt.Chart(df).mark_text(color="white").encode(
x="job_title:N",
y="kpred:N",
text="count()"
)
(c+c_text).properties(
height=400,
width=400,
title = 'KNN prediction confusion matrix'
)
It could also be seen that a lot of Data scientists were erronously predicted as Data engineers, although with a lot of the occupations being “data” related, in retrospect the model has done quite well. Possible alterations might include grouping different data occupations together. Another possible change is to use only the US enteries, since the distribution of nations seem rather skewed.
Extra test: Only US Data#
In an attempt to improve the accuracy, here is the same models but only trained on U.S. companies.
df = pd.read_csv('ds_salaries.csv')
df_us = df[df['company_location'] == 'US']
df_us['isFulltime_bool'] = (df_us['employment_type'] == 'FT')
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
This is separate from the ipykernel package so we can avoid doing imports until
def make_onehot_us(c_name):
"""
redefined for df_us
Input: column name
Output: onehot matrix of corresponding column
"""
le = LabelEncoder()
oh = OneHotEncoder()
let = le.fit_transform(df_us[c_name])
values = len(df_us[c_name].unique())
oht = oh.fit_transform(let.reshape(-1,1)).toarray()
df_ = pd.DataFrame(oht, columns=le.inverse_transform(range(values)))
# refer to source 3 for the labeling
return df_
features = ['salary', 'isFulltime_bool']
one_hots = ['experience_level', 'company_size']
df_feats = df_us[features] # redefining df_feats for df_us
for col in one_hots:
df_feats = pd.concat([df_feats.reset_index(drop=True), make_onehot_us(col).reset_index(drop=True)], axis=1)
rfc = RandomForestClassifier(n_estimators=140, max_leaf_nodes=180)
X_train, X_test, y_train, y_test = train_test_split(df_feats, df_us['job_title'])
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)
0.41842105263157897
It would appear that the hypothesis that the U.S. data skew caused a lower accuracy was false, and the company location was indeed useful for predicting job titles.
References#
Your code above should include references. Here is some additional space for references.
Dataset: https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023
Pandas concat without increasing number of rows: https://stackoverflow.com/questions/50368145/pandas-concat-increases-number-of-rows
One hot matrix mapping: https://stackoverflow.com/questions/38978853/onehotencoding-mapping
Inverting the labels of onehot matrix: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
Submission#
Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.
Created in Deepnote