Data science occupation prediction

Data science occupation prediction#

Author: Anjun Hou

Course Project, UC Irvine, Math 10, S23

Introduction#

This project focuses on analyzing the income and salary of various samples in an attempt to predict job titles of individuals using various features such as company location, company size, salar, etc.

Visualizing the data#

import pandas as pd
import altair as alt

df = pd.read_csv('ds_salaries.csv')
df

	work_year	experience_level	employment_type	job_title	salary	salary_currency	salary_in_usd	employee_residence	remote_ratio	company_location	company_size
0	2023	SE	FT	Principal Data Scientist	80000	EUR	85847	ES	100	ES	L
1	2023	MI	CT	ML Engineer	30000	USD	30000	US	100	US	S
2	2023	MI	CT	ML Engineer	25500	USD	25500	US	100	US	S
3	2023	SE	FT	Data Scientist	175000	USD	175000	CA	100	CA	M
4	2023	SE	FT	Data Scientist	120000	USD	120000	CA	100	CA	M
...	...	...	...	...	...	...	...	...	...	...	...
3750	2020	SE	FT	Data Scientist	412000	USD	412000	US	100	US	L
3751	2021	MI	FT	Principal Data Scientist	151000	USD	151000	US	100	US	L
3752	2020	EN	FT	Data Scientist	105000	USD	105000	US	100	US	S
3753	2020	EN	CT	Business Data Analyst	100000	USD	100000	US	100	US	L
3754	2021	SE	FT	Data Science Manager	7000000	INR	94665	IN	50	IN	L

3755 rows × 11 columns

Since there are too many occupations (93 total), we will look at the top 6 occupations, since only the top 6 have over 100 entries.

df = df[df['job_title'].isin(df['job_title'].value_counts()[:6].keys())]  # all the titles with over 100 people

df

	work_year	experience_level	employment_type	job_title	salary	salary_currency	salary_in_usd	employee_residence	remote_ratio	company_location	company_size
3	2023	SE	FT	Data Scientist	175000	USD	175000	CA	100	CA	M
4	2023	SE	FT	Data Scientist	120000	USD	120000	CA	100	CA	M
7	2023	SE	FT	Data Scientist	219000	USD	219000	CA	0	CA	M
8	2023	SE	FT	Data Scientist	141000	USD	141000	CA	0	CA	M
9	2023	SE	FT	Data Scientist	147100	USD	147100	US	0	US	M
...	...	...	...	...	...	...	...	...	...	...	...
3744	2020	SE	FT	Machine Learning Engineer	40000	EUR	45618	HR	100	HR	S
3746	2021	MI	FT	Data Scientist	160000	SGD	119059	SG	100	IL	M
3748	2021	MI	FT	Data Engineer	24000	EUR	28369	MT	50	MT	L
3750	2020	SE	FT	Data Scientist	412000	USD	412000	US	100	US	L
3752	2020	EN	FT	Data Scientist	105000	USD	105000	US	100	US	S

2985 rows × 11 columns

df['isFulltime_bool'] = (df['employment_type'] == 'FT')

/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

Data visualization#

below are some histograms to demonstrate distribution of the occupations with respect to different columns

alt.Chart(df).mark_bar().encode(
    x = 'job_title',
    y = 'salary_in_usd',
    tooltip = 'salary_in_usd'
).properties(
    width=550,
    height=250
)

Using Data scientists as an example, it could be seen that the highest portion of the data scientists are in the US. Although this could also be due to data collection bias of more datapoints in the US. The location is still none-the-less an important feature to consider with respect to salary and salary in USD.

alt.Chart(df[df['job_title'] == 'Data Scientist']).mark_bar().encode(
    x = 'company_location',
    y = 'count()',
    tooltip = 'count()'
).properties(
    width=550,
    height=250,
    title = 'Data scientist location distribution'
)

Model training#

In order to predict the occupation, we will use an random forest classifier and k-nearest neighbour model.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In order to use the features that are not numerical, the one hot matrix can be applied so that each distinct feature choice can be an distinct variable, below is an example.

le = LabelEncoder()
oh = OneHotEncoder()
let = le.fit_transform(df['company_size'])
oht = oh.fit_transform(let.reshape(-1,1)).toarray()
df_temp = pd.DataFrame(oht, columns=le.inverse_transform(range(3)))

The df_temp dataframe has converted the series of L, M, S company sizes into 3 columns of boolean series depending on the company size, which the machine learning models can be trained with as an input feature.

df_temp

	L	M	S
0	0.0	1.0	0.0
1	0.0	1.0	0.0
2	0.0	1.0	0.0
3	0.0	1.0	0.0
4	0.0	1.0	0.0
...	...	...	...
2980	0.0	0.0	1.0
2981	0.0	1.0	0.0
2982	1.0	0.0	0.0
2983	1.0	0.0	0.0
2984	0.0	0.0	1.0

2985 rows × 3 columns

def make_onehot(c_name):
    """
    Create onehot matrix based on given column name, refer to source 2 for how this works
    Input: column name
    Output: onehot matrix of corresponding column
    """
    le = LabelEncoder()
    oh = OneHotEncoder()
    let = le.fit_transform(df[c_name])
    values = len(df[c_name].unique())
    oht = oh.fit_transform(let.reshape(-1,1)).toarray()
    df_ = pd.DataFrame(oht, columns=le.inverse_transform(range(values)))
    # refer to source 3 for the labeling
    return df_

one_hots = ['experience_level', 'company_size', 'company_location']

features = ['salary_in_usd', 'salary', 'isFulltime_bool']

df_feats = df[features]

for col in one_hots:
    df_feats = pd.concat([df_feats.reset_index(drop=True), make_onehot(col).reset_index(drop=True)], axis=1)
    # refer to source 1 for reset_index

Currently with all the different one hot matrix added, the features data frame looks like this

df_feats

	salary_in_usd	salary	isFulltime_bool	EN	EX	MI	SE	L	M	S	...	PR	PT	RO	SG	SI	TH	TR	UA	US	VN
0	175000	175000	True	0.0	0.0	0.0	1.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	120000	120000	True	0.0	0.0	0.0	1.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	219000	219000	True	0.0	0.0	0.0	1.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	141000	141000	True	0.0	0.0	0.0	1.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	147100	147100	True	0.0	0.0	0.0	1.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2980	45618	40000	True	0.0	0.0	0.0	1.0	0.0	0.0	1.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2981	119059	160000	True	0.0	0.0	1.0	0.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2982	28369	24000	True	0.0	0.0	1.0	0.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2983	412000	412000	True	0.0	0.0	0.0	1.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
2984	105000	105000	True	1.0	0.0	0.0	0.0	0.0	0.0	1.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0

2985 rows × 62 columns

Model making#

Below is how the model and training data is made. RFC uses a collection of decision trees to classify given inputs. KNN chooses the average of different clusters to deterine which group the given inputs belong in.

rfc = RandomForestClassifier(n_estimators=140, max_leaf_nodes=180)  # instantiate

X_train, X_test, y_train, y_test = train_test_split(df_feats, df['job_title'])  # split

rfc.fit(X_train, y_train)  # fit

RandomForestClassifier(max_leaf_nodes=180, n_estimators=140)

Below are the scoring results for train and test sets.

rfc.score(X_train, y_train)

0.7010723860589813

rfc.score(X_test, y_test)
# 52% isn't bad

0.5060240963855421

The difference in score is about 17%, which is understandable. A difference of score at this degree should not be considered overfitting.

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

KNeighborsClassifier()

knn.score(X_test, y_test)
# not as good

0.4404283801874163

Summary#

From the analysis of the KNN and RFC models, it can be said that there is no definitive good predicitions from the model. Although 53% from the RFC isn’t bad, it is not incredibly accurate either. The KNN falls a bit behind with 43%, but it serves to demonstrate the advantages of an RFC. Below is a confusion matrix analysing which were the most common mistakes of the models.

df['rpred'] = rfc.predict(df_feats)
df['kpred'] = knn.predict(df_feats)

/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  

alt.data_transformers.enable('default', max_rows=15000)

c = alt.Chart(df).mark_rect().encode(
    x="job_title:N",
    y="rpred:N",
    color=alt.Color("count()", scale=alt.Scale(scheme="tableau20"))
)

c_text = alt.Chart(df).mark_text(color="white").encode(
    x="job_title:N",
    y="rpred:N",
    text="count()"
)

(c+c_text).properties(
    height=400,
    width=400,
    title = 'RFC prediction confusion matrix'
)

alt.data_transformers.enable('default', max_rows=15000)

c = alt.Chart(df).mark_rect().encode(
    x="job_title:N",
    y="kpred:N",
    color=alt.Color("count()", scale=alt.Scale(scheme="tableau20"))
)

c_text = alt.Chart(df).mark_text(color="white").encode(
    x="job_title:N",
    y="kpred:N",
    text="count()"
)

(c+c_text).properties(
    height=400,
    width=400,
    title = 'KNN prediction confusion matrix'
)

It could also be seen that a lot of Data scientists were erronously predicted as Data engineers, although with a lot of the occupations being “data” related, in retrospect the model has done quite well. Possible alterations might include grouping different data occupations together. Another possible change is to use only the US enteries, since the distribution of nations seem rather skewed.

Extra test: Only US Data#

In an attempt to improve the accuracy, here is the same models but only trained on U.S. companies.

df = pd.read_csv('ds_salaries.csv')
df_us = df[df['company_location'] == 'US']
df_us['isFulltime_bool'] = (df_us['employment_type'] == 'FT')

/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until

def make_onehot_us(c_name):
    """
    redefined for df_us
    Input: column name
    Output: onehot matrix of corresponding column
    """
    le = LabelEncoder()
    oh = OneHotEncoder()
    let = le.fit_transform(df_us[c_name])
    values = len(df_us[c_name].unique())
    oht = oh.fit_transform(let.reshape(-1,1)).toarray()
    df_ = pd.DataFrame(oht, columns=le.inverse_transform(range(values)))
    # refer to source 3 for the labeling
    return df_

features = ['salary', 'isFulltime_bool']
one_hots = ['experience_level', 'company_size']
df_feats = df_us[features]  # redefining df_feats for df_us
for col in one_hots:
    df_feats = pd.concat([df_feats.reset_index(drop=True), make_onehot_us(col).reset_index(drop=True)], axis=1)

rfc = RandomForestClassifier(n_estimators=140, max_leaf_nodes=180)
X_train, X_test, y_train, y_test = train_test_split(df_feats, df_us['job_title'])
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)

0.41842105263157897

It would appear that the hypothesis that the U.S. data skew caused a lower accuracy was false, and the company location was indeed useful for predicting job titles.