Data science occupation prediction#

Author: Anjun Hou

Course Project, UC Irvine, Math 10, S23

Introduction#

This project focuses on analyzing the income and salary of various samples in an attempt to predict job titles of individuals using various features such as company location, company size, salar, etc.

Visualizing the data#

import pandas as pd
import altair as alt
df = pd.read_csv('ds_salaries.csv')
df
work_year experience_level employment_type job_title salary salary_currency salary_in_usd employee_residence remote_ratio company_location company_size
0 2023 SE FT Principal Data Scientist 80000 EUR 85847 ES 100 ES L
1 2023 MI CT ML Engineer 30000 USD 30000 US 100 US S
2 2023 MI CT ML Engineer 25500 USD 25500 US 100 US S
3 2023 SE FT Data Scientist 175000 USD 175000 CA 100 CA M
4 2023 SE FT Data Scientist 120000 USD 120000 CA 100 CA M
... ... ... ... ... ... ... ... ... ... ... ...
3750 2020 SE FT Data Scientist 412000 USD 412000 US 100 US L
3751 2021 MI FT Principal Data Scientist 151000 USD 151000 US 100 US L
3752 2020 EN FT Data Scientist 105000 USD 105000 US 100 US S
3753 2020 EN CT Business Data Analyst 100000 USD 100000 US 100 US L
3754 2021 SE FT Data Science Manager 7000000 INR 94665 IN 50 IN L

3755 rows Ă— 11 columns

Since there are too many occupations (93 total), we will look at the top 6 occupations, since only the top 6 have over 100 entries.

df = df[df['job_title'].isin(df['job_title'].value_counts()[:6].keys())]  # all the titles with over 100 people
df
work_year experience_level employment_type job_title salary salary_currency salary_in_usd employee_residence remote_ratio company_location company_size
3 2023 SE FT Data Scientist 175000 USD 175000 CA 100 CA M
4 2023 SE FT Data Scientist 120000 USD 120000 CA 100 CA M
7 2023 SE FT Data Scientist 219000 USD 219000 CA 0 CA M
8 2023 SE FT Data Scientist 141000 USD 141000 CA 0 CA M
9 2023 SE FT Data Scientist 147100 USD 147100 US 0 US M
... ... ... ... ... ... ... ... ... ... ... ...
3744 2020 SE FT Machine Learning Engineer 40000 EUR 45618 HR 100 HR S
3746 2021 MI FT Data Scientist 160000 SGD 119059 SG 100 IL M
3748 2021 MI FT Data Engineer 24000 EUR 28369 MT 50 MT L
3750 2020 SE FT Data Scientist 412000 USD 412000 US 100 US L
3752 2020 EN FT Data Scientist 105000 USD 105000 US 100 US S

2985 rows Ă— 11 columns

df['isFulltime_bool'] = (df['employment_type'] == 'FT')
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

Data visualization#

below are some histograms to demonstrate distribution of the occupations with respect to different columns

alt.Chart(df).mark_bar().encode(
    x = 'job_title',
    y = 'salary_in_usd',
    tooltip = 'salary_in_usd'
).properties(
    width=550,
    height=250
)

Using Data scientists as an example, it could be seen that the highest portion of the data scientists are in the US. Although this could also be due to data collection bias of more datapoints in the US. The location is still none-the-less an important feature to consider with respect to salary and salary in USD.

alt.Chart(df[df['job_title'] == 'Data Scientist']).mark_bar().encode(
    x = 'company_location',
    y = 'count()',
    tooltip = 'count()'
).properties(
    width=550,
    height=250,
    title = 'Data scientist location distribution'
)

Model training#

In order to predict the occupation, we will use an random forest classifier and k-nearest neighbour model.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In order to use the features that are not numerical, the one hot matrix can be applied so that each distinct feature choice can be an distinct variable, below is an example.

le = LabelEncoder()
oh = OneHotEncoder()
let = le.fit_transform(df['company_size'])
oht = oh.fit_transform(let.reshape(-1,1)).toarray()
df_temp = pd.DataFrame(oht, columns=le.inverse_transform(range(3)))

The df_temp dataframe has converted the series of L, M, S company sizes into 3 columns of boolean series depending on the company size, which the machine learning models can be trained with as an input feature.

df_temp
L M S
0 0.0 1.0 0.0
1 0.0 1.0 0.0
2 0.0 1.0 0.0
3 0.0 1.0 0.0
4 0.0 1.0 0.0
... ... ... ...
2980 0.0 0.0 1.0
2981 0.0 1.0 0.0
2982 1.0 0.0 0.0
2983 1.0 0.0 0.0
2984 0.0 0.0 1.0

2985 rows Ă— 3 columns

def make_onehot(c_name):
    """
    Create onehot matrix based on given column name, refer to source 2 for how this works
    Input: column name
    Output: onehot matrix of corresponding column
    """
    le = LabelEncoder()
    oh = OneHotEncoder()
    let = le.fit_transform(df[c_name])
    values = len(df[c_name].unique())
    oht = oh.fit_transform(let.reshape(-1,1)).toarray()
    df_ = pd.DataFrame(oht, columns=le.inverse_transform(range(values)))
    # refer to source 3 for the labeling
    return df_
one_hots = ['experience_level', 'company_size', 'company_location']
features = ['salary_in_usd', 'salary', 'isFulltime_bool']
df_feats = df[features]
for col in one_hots:
    df_feats = pd.concat([df_feats.reset_index(drop=True), make_onehot(col).reset_index(drop=True)], axis=1)
    # refer to source 1 for reset_index

Currently with all the different one hot matrix added, the features data frame looks like this

df_feats
salary_in_usd salary isFulltime_bool EN EX MI SE L M S ... PR PT RO SG SI TH TR UA US VN
0 175000 175000 True 0.0 0.0 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 120000 120000 True 0.0 0.0 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 219000 219000 True 0.0 0.0 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 141000 141000 True 0.0 0.0 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 147100 147100 True 0.0 0.0 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2980 45618 40000 True 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2981 119059 160000 True 0.0 0.0 1.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2982 28369 24000 True 0.0 0.0 1.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2983 412000 412000 True 0.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2984 105000 105000 True 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

2985 rows Ă— 62 columns

Model making#

Below is how the model and training data is made. RFC uses a collection of decision trees to classify given inputs. KNN chooses the average of different clusters to deterine which group the given inputs belong in.

rfc = RandomForestClassifier(n_estimators=140, max_leaf_nodes=180)  # instantiate
X_train, X_test, y_train, y_test = train_test_split(df_feats, df['job_title'])  # split
rfc.fit(X_train, y_train)  # fit
RandomForestClassifier(max_leaf_nodes=180, n_estimators=140)

Below are the scoring results for train and test sets.

rfc.score(X_train, y_train)
0.7010723860589813
rfc.score(X_test, y_test)
# 52% isn't bad
0.5060240963855421

The difference in score is about 17%, which is understandable. A difference of score at this degree should not be considered overfitting.

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
KNeighborsClassifier()
knn.score(X_test, y_test)
# not as good
0.4404283801874163

Summary#

From the analysis of the KNN and RFC models, it can be said that there is no definitive good predicitions from the model. Although 53% from the RFC isn’t bad, it is not incredibly accurate either. The KNN falls a bit behind with 43%, but it serves to demonstrate the advantages of an RFC. Below is a confusion matrix analysing which were the most common mistakes of the models.

df['rpred'] = rfc.predict(df_feats)
df['kpred'] = knn.predict(df_feats)
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
alt.data_transformers.enable('default', max_rows=15000)

c = alt.Chart(df).mark_rect().encode(
    x="job_title:N",
    y="rpred:N",
    color=alt.Color("count()", scale=alt.Scale(scheme="tableau20"))
)

c_text = alt.Chart(df).mark_text(color="white").encode(
    x="job_title:N",
    y="rpred:N",
    text="count()"
)

(c+c_text).properties(
    height=400,
    width=400,
    title = 'RFC prediction confusion matrix'
)
alt.data_transformers.enable('default', max_rows=15000)

c = alt.Chart(df).mark_rect().encode(
    x="job_title:N",
    y="kpred:N",
    color=alt.Color("count()", scale=alt.Scale(scheme="tableau20"))
)

c_text = alt.Chart(df).mark_text(color="white").encode(
    x="job_title:N",
    y="kpred:N",
    text="count()"
)

(c+c_text).properties(
    height=400,
    width=400,
    title = 'KNN prediction confusion matrix'
)

It could also be seen that a lot of Data scientists were erronously predicted as Data engineers, although with a lot of the occupations being “data” related, in retrospect the model has done quite well. Possible alterations might include grouping different data occupations together. Another possible change is to use only the US enteries, since the distribution of nations seem rather skewed.

Extra test: Only US Data#

In an attempt to improve the accuracy, here is the same models but only trained on U.S. companies.

df = pd.read_csv('ds_salaries.csv')
df_us = df[df['company_location'] == 'US']
df_us['isFulltime_bool'] = (df_us['employment_type'] == 'FT')
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
def make_onehot_us(c_name):
    """
    redefined for df_us
    Input: column name
    Output: onehot matrix of corresponding column
    """
    le = LabelEncoder()
    oh = OneHotEncoder()
    let = le.fit_transform(df_us[c_name])
    values = len(df_us[c_name].unique())
    oht = oh.fit_transform(let.reshape(-1,1)).toarray()
    df_ = pd.DataFrame(oht, columns=le.inverse_transform(range(values)))
    # refer to source 3 for the labeling
    return df_
features = ['salary', 'isFulltime_bool']
one_hots = ['experience_level', 'company_size']
df_feats = df_us[features]  # redefining df_feats for df_us
for col in one_hots:
    df_feats = pd.concat([df_feats.reset_index(drop=True), make_onehot_us(col).reset_index(drop=True)], axis=1)
rfc = RandomForestClassifier(n_estimators=140, max_leaf_nodes=180)
X_train, X_test, y_train, y_test = train_test_split(df_feats, df_us['job_title'])
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)
0.41842105263157897

It would appear that the hypothesis that the U.S. data skew caused a lower accuracy was false, and the company location was indeed useful for predicting job titles.

References#

Your code above should include references. Here is some additional space for references.

Dataset: https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023

  1. Pandas concat without increasing number of rows: https://stackoverflow.com/questions/50368145/pandas-concat-increases-number-of-rows

  1. One hot matrix mapping: https://stackoverflow.com/questions/38978853/onehotencoding-mapping

  1. Inverting the labels of onehot matrix: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote