Credit Card Approval Prediction
Contents
Credit Card Approval Prediction¶
Author: Michael Choi
Course Project, UC Irvine, Math 10, W22
Introduction¶
I will be analyzing the following dataset that lists the following information on credit card apporval ratings. I want to see what professions, ages, education statues get approved and at what odds that they do get approved.
Main portion of the project¶
import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
After importing libraries, we will load two datasets. The first dataset called App is a dataset of demographics of credit users such as their age, profession, income, etc. The next dataset is Cred, a dataset with credit records and payment history.
app= pd.read_csv('application_record.csv',header=None)
cred = pd.read_csv('credit_record.csv', header=None)
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3524: DtypeWarning: Columns (0,4,5,10,11,12,13,14,15,17) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3524: DtypeWarning: Columns (0,1) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)
cred.head()
0 | 1 | 2 | |
---|---|---|---|
0 | ID | MONTHS_BALANCE | STATUS |
1 | 5001711 | 0 | X |
2 | 5001711 | -1 | 0 |
3 | 5001711 | -2 | 0 |
4 | 5001711 | -3 | 0 |
Notice how the dataset has the column names we want as row 0. We will fix this by first renaming the columns and then deleting the first row. Finally, we will reindex the dataset with df.reset_index(). Note that these methods are apart of the pandas library.
More information about these methods can be found in the references section.
app = app.rename(columns = app.iloc[0,:]).drop([0]).reset_index(drop = True)
cred = cred.rename(columns = cred.iloc[0,:]).drop([0]).reset_index(drop = True)
cred.head()
ID | MONTHS_BALANCE | STATUS | |
---|---|---|---|
0 | 5001711 | 0 | X |
1 | 5001711 | -1 | 0 |
2 | 5001711 | -2 | 0 |
3 | 5001711 | -3 | 0 |
4 | 5001712 | 0 | C |
Since there is no column that states if they got approved or not, I made my own based on who had a balance outstanding on their credit record. Months are listed as negative because it represents how many months ago (0 is current month, -1 is last month, etc.)
X means they didnt have a loan for the month, C means they paid it off. Everything else means they had a balance which I will record in the “Balance” column I am creating.
cred['Balance'] = cred['STATUS'].map(lambda x: 'N' if (x == 'C') or (x == 'X') else 'Y')
I noticed that, using pd.series.value_counts(), some IDs of the two datasets did not match i.e there were some ID numbers in the app dataset not in the cred dataset and vice versa.
Since I needed every applicant’s demographics to create a prediction and their credit record to determine if they are approved or not, I used pd.df.isin() to only use IDs present in both datasets.
cred['ID'].value_counts()
5022997 61
5117883 61
5146385 61
5091644 61
5118386 61
..
5079234 1
5079233 1
5002713 1
5135310 1
5149281 1
Name: ID, Length: 45986, dtype: int64
app['ID'].value_counts()
7137299 2
7702238 2
7282535 2
7243768 2
7050948 2
..
5690727 1
6621262 1
6621261 1
6621260 1
6842885 1
Name: ID, Length: 438510, dtype: int64
app = app[app["ID"].isin(cred["ID"].value_counts()[:].index)]
cred = cred[cred["ID"].isin(app["ID"].value_counts()[:].index)]
For a bit of more dataset cleaning up, I used the panda method pd.to_numeric() for the IDs because I wanted to sort both datasets by increasing ID numbers with pd. Reset indices again.
app['ID'] = pd.to_numeric(app['ID'])
cred['ID'] = pd.to_numeric(cred['ID'])
app = app.sort_values(by = ['ID']).reset_index(drop = True)
cred = cred.sort_values(by = ['ID']).reset_index(drop = True)
app.keys()
Index(['ID', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN',
'AMT_INCOME_TOTAL', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'DAYS_BIRTH',
'DAYS_EMPLOYED', 'FLAG_MOBIL', 'FLAG_WORK_PHONE', 'FLAG_PHONE',
'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS'],
dtype='object')
For later on in the dataset, we will change more columns to numeric. In addition, for birth days, I decided to divide by 365 to get years instead of days old each applicant is. I multipled by -1 since the column is listed as negatives such as the months column as described earlier. Same for the ‘DAYS_EMPLOYED’ column.
app['AMT_INCOME_TOTAL'] = pd.to_numeric(app['AMT_INCOME_TOTAL'])
app['DAYS_BIRTH'] = -pd.to_numeric(app['DAYS_BIRTH'])/365
app['DAYS_EMPLOYED'] = -pd.to_numeric(app['DAYS_EMPLOYED'])/365
This part was the trickest and longest part of my project process. Notice below how there are multiple rows of the same ID, one for each month and each month can have a different values in the ‘Balance’ column.
cred
ID | MONTHS_BALANCE | STATUS | Balance | |
---|---|---|---|---|
0 | 5008804 | 0 | C | N |
1 | 5008804 | -15 | X | N |
2 | 5008804 | -14 | 0 | Y |
3 | 5008804 | -13 | 1 | Y |
4 | 5008804 | -12 | C | N |
... | ... | ... | ... | ... |
359769 | 5150479 | -3 | X | N |
359770 | 5150479 | -4 | X | N |
359771 | 5150479 | -5 | X | N |
359772 | 5150479 | -6 | X | N |
359773 | 5150479 | -8 | X | N |
359774 rows × 4 columns
I needed to find a way to count how many ‘Y’ each ID row had. This will tell me how many months each applicant, identified by the ID number, had a balance on the credit record; that is, they paid late.
After trying many for loops, list comprehensions, creating new datasets, I decided to use the groupby method we learned in week 10 of class. This wil also be a part of my “Extra” section.
a = cred.groupby("ID")['Balance'].value_counts().sort_index(ascending=True)
print(a)
ID Balance
5008804 N 14
Y 2
5008805 N 13
Y 2
5008806 N 23
..
5150466 Y 2
5150467 N 1
Y 6
5150468 Y 6
5150479 N 9
Name: Balance, Length: 28889, dtype: int64
At first, I tried using a for x,y loop as used in class; however, I had issues with breaking out of the loop and the datatype of y was a multiindex which was confusing to call from. Instead, I created a list for ID, Labels (Y or N), and the values themselves (months with or without a balance)
ids = a.keys().get_level_values('ID')
values = a.values
labels = [(a.keys()[i][1] == 'Y') for i in range(len(a))]
I created a new dataframe with a breakdown values grouped by ID known as Breakdown. I created a new dataframe called Denies. This is a dataset with rows that only carry a balance so I can the number of months each ID carried a credit card balance.
Breakdown = pd.DataFrame({'ID': ids, 'Labels':labels, 'Months': values})
Breakdown = Breakdown.sort_values(by = ['ID'])
Denies = Breakdown[Breakdown["Labels"] == True].reset_index(drop = True)
Denies.head()
ID | Labels | Months | |
---|---|---|---|
0 | 5008804 | True | 2 |
1 | 5008805 | True | 2 |
2 | 5008806 | True | 7 |
3 | 5008808 | True | 2 |
4 | 5008810 | True | 6 |
print(f"The number of months with an outstanding balance for the first ID is {Breakdown['Months'][0]}")
The number of months with an outstanding balance for the first ID is 14
Now we will create a new column in Denies that will deny applicants if they have more than 8 months were they carried a balance. We will add all these IDs to a list called denies.
Denies['Approval']= Breakdown['Months'].map(lambda x: 'N' if x > 8 else 'Y')
denies = list(Denies[Denies['Approval'] == 'N']['ID'].values)
From here, we will create a new column in the app dataset where applicants with IDs in the denies list are denied and the rest are given an approval status for now.
app['Approval'] = app['ID'].map(lambda x: 'N' if denies.count(x) else 'Y')
We will also deny applicants with an income less than 40000
for i in range(len(app)):
if ((app['AMT_INCOME_TOTAL'][i] < 40000)):
app['Approval'][i] = 'N'
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
This is separate from the ipykernel package so we can avoid doing imports until
app
ID | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | DAYS_BIRTH | DAYS_EMPLOYED | FLAG_MOBIL | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | Approval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5008804 | M | Y | Y | 0 | 427500.0 | Working | Higher education | Civil marriage | Rented apartment | 32.890411 | 12.443836 | 1 | 1 | 0 | 0 | NaN | 2.0 | N |
1 | 5008805 | M | Y | Y | 0 | 427500.0 | Working | Higher education | Civil marriage | Rented apartment | 32.890411 | 12.443836 | 1 | 1 | 0 | 0 | NaN | 2.0 | Y |
2 | 5008806 | M | Y | Y | 0 | 112500.0 | Working | Secondary / secondary special | Married | House / apartment | 58.832877 | 3.106849 | 1 | 0 | 0 | 0 | Security staff | 2.0 | N |
3 | 5008808 | F | N | Y | 0 | 270000.0 | Commercial associate | Secondary / secondary special | Single / not married | House / apartment | 52.356164 | 8.358904 | 1 | 0 | 1 | 1 | Sales staff | 1.0 | Y |
4 | 5008809 | F | N | Y | 0 | 270000.0 | Commercial associate | Secondary / secondary special | Single / not married | House / apartment | 52.356164 | 8.358904 | 1 | 0 | 1 | 1 | Sales staff | 1.0 | Y |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
16612 | 5150465 | F | Y | Y | 1 | 112500.0 | Working | Secondary / secondary special | Married | House / apartment | 38.997260 | 0.846575 | 1 | 0 | 0 | 0 | Core staff | 3.0 | N |
16613 | 5150466 | F | Y | Y | 1 | 112500.0 | Working | Secondary / secondary special | Married | House / apartment | 38.997260 | 0.846575 | 1 | 0 | 0 | 0 | Core staff | 3.0 | N |
16614 | 5150467 | F | N | Y | 0 | 180000.0 | Working | Secondary / secondary special | Married | House / apartment | 48.528767 | 6.687671 | 1 | 0 | 0 | 0 | Cooking staff | 2.0 | N |
16615 | 5150468 | F | N | Y | 0 | 180000.0 | Working | Secondary / secondary special | Married | House / apartment | 48.528767 | 6.687671 | 1 | 0 | 0 | 0 | Cooking staff | 2.0 | N |
16616 | 5150479 | F | N | N | 0 | 90000.0 | Working | Higher education | Married | With parents | 34.010959 | 0.350685 | 1 | 0 | 0 | 0 | NaN | 2.0 | Y |
16617 rows × 19 columns
Now, onto Classification and graphing. We will use the following columns to predict if applicants get approved or not.
X = app[['AMT_INCOME_TOTAL','DAYS_EMPLOYED','DAYS_BIRTH']]
y = app['Approval']
We will scale the data
scalar = StandardScaler()
scalar.fit(X)
X_scaled = scalar.transform(X)
I want to determine which value of k for KNeighborsClassifier() will give me the lowest log_loss value. This is original code. Start by making a function loss(k).
def loss(k):
clf = KNeighborsClassifier(n_neighbors=k)
clf.fit(X_scaled,y)
return log_loss(y,clf.predict_proba(X_scaled))
Create a pandas dataset with two columns: one with a k-value and the other with the associated log_loss value from the function we created above. Will take a while to load, 30 to 50 seconds. Afterwards, graph this dataset.
list(range(1,50))
log = pd.DataFrame({'k-value':list(range(1,100))})
log['loss'] = [loss(x) for x in log['k-value']]
log
k-value | loss | |
---|---|---|
0 | 1 | 11.213619 |
1 | 2 | 4.439504 |
2 | 3 | 2.204020 |
3 | 4 | 1.357061 |
4 | 5 | 0.901659 |
... | ... | ... |
94 | 95 | 0.679062 |
95 | 96 | 0.678998 |
96 | 97 | 0.679091 |
97 | 98 | 0.678887 |
98 | 99 | 0.678756 |
99 rows × 2 columns
alt.Chart(log).mark_circle().encode(
x = 'k-value',
y = 'loss',
tooltip = ['k-value','loss']
)
Close examination of the chart and graph with tooltip will show k = 10 gives the lowest log_loss value
clf = KNeighborsClassifier(n_neighbors=10)
clf.fit(X_scaled,y)
KNeighborsClassifier(n_neighbors=10)
It is important to split data into a training set and test set. This allow us to train the model and then have values to test. We can also look for signs of overfitting if test score is much lower than training score. This occurs when the model follows the training set too closely.
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y, test_size = .3)
(clf.score(X_train,y_train),clf.score(X_test,y_test))
(0.6614220617315794, 0.6430004011231448)
Since the training and test set scores are roughly the same, we do not have signs of overfitting
log_loss(y,clf.predict_proba(X_scaled))
0.6154558798665394
Now we will graph the dataset. However, we can see the range contains lots of outliers. We will create a new dataset with incomes less than 600000 only called graph
alt.data_transformers.disable_max_rows()
alt.Chart(app).mark_circle().encode(
alt.X('DAYS_BIRTH', scale=alt.Scale(zero=False)),
alt.Y('AMT_INCOME_TOTAL',scale=alt.Scale(zero=False)),
color = 'Approval',
tooltip = ['DAYS_BIRTH','AMT_INCOME_TOTAL']
).properties(
title = "Age vs Income"
)
graph = app[app['AMT_INCOME_TOTAL'] < 600000]
alt.data_transformers.disable_max_rows()
c1 = alt.Chart(graph).mark_circle().encode(
alt.X('DAYS_BIRTH',scale=alt.Scale(zero= False)),
alt.Y('AMT_INCOME_TOTAL',scale=alt.Scale(domain=(0,600000))),
color = 'Approval',
tooltip = ['DAYS_BIRTH','AMT_INCOME_TOTAL']
).properties(
title = "Age vs Income"
)
c1
We will make a second Graph with predicted Approval to compare
app["pred"] = clf.predict(X_scaled)
graph = app[app['AMT_INCOME_TOTAL'] < 600000]
app
ID | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | DAYS_BIRTH | DAYS_EMPLOYED | FLAG_MOBIL | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | Approval | pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5008804 | M | Y | Y | 0 | 427500.0 | Working | Higher education | Civil marriage | Rented apartment | 32.890411 | 12.443836 | 1 | 1 | 0 | 0 | NaN | 2.0 | N | Y |
1 | 5008805 | M | Y | Y | 0 | 427500.0 | Working | Higher education | Civil marriage | Rented apartment | 32.890411 | 12.443836 | 1 | 1 | 0 | 0 | NaN | 2.0 | Y | Y |
2 | 5008806 | M | Y | Y | 0 | 112500.0 | Working | Secondary / secondary special | Married | House / apartment | 58.832877 | 3.106849 | 1 | 0 | 0 | 0 | Security staff | 2.0 | N | Y |
3 | 5008808 | F | N | Y | 0 | 270000.0 | Commercial associate | Secondary / secondary special | Single / not married | House / apartment | 52.356164 | 8.358904 | 1 | 0 | 1 | 1 | Sales staff | 1.0 | Y | Y |
4 | 5008809 | F | N | Y | 0 | 270000.0 | Commercial associate | Secondary / secondary special | Single / not married | House / apartment | 52.356164 | 8.358904 | 1 | 0 | 1 | 1 | Sales staff | 1.0 | Y | Y |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
16612 | 5150465 | F | Y | Y | 1 | 112500.0 | Working | Secondary / secondary special | Married | House / apartment | 38.997260 | 0.846575 | 1 | 0 | 0 | 0 | Core staff | 3.0 | N | N |
16613 | 5150466 | F | Y | Y | 1 | 112500.0 | Working | Secondary / secondary special | Married | House / apartment | 38.997260 | 0.846575 | 1 | 0 | 0 | 0 | Core staff | 3.0 | N | N |
16614 | 5150467 | F | N | Y | 0 | 180000.0 | Working | Secondary / secondary special | Married | House / apartment | 48.528767 | 6.687671 | 1 | 0 | 0 | 0 | Cooking staff | 2.0 | N | N |
16615 | 5150468 | F | N | Y | 0 | 180000.0 | Working | Secondary / secondary special | Married | House / apartment | 48.528767 | 6.687671 | 1 | 0 | 0 | 0 | Cooking staff | 2.0 | N | N |
16616 | 5150479 | F | N | N | 0 | 90000.0 | Working | Higher education | Married | With parents | 34.010959 | 0.350685 | 1 | 0 | 0 | 0 | NaN | 2.0 | Y | N |
16617 rows × 20 columns
alt.data_transformers.disable_max_rows()
c2 = alt.Chart(graph).mark_circle().encode(
alt.X('DAYS_BIRTH',scale=alt.Scale(zero= False)),
alt.Y('AMT_INCOME_TOTAL',scale=alt.Scale(domain=(0,600000))),
color = 'pred',
tooltip = ['DAYS_BIRTH','AMT_INCOME_TOTAL']
).properties(
title = "Age vs Income"
)
c1
c2
For fun, here is an additional graph with more interactive elements such as selection and a bar chart. I got some of this code from altair-viz.github.io which will be listed in the refrences section.
interval = alt.selection_interval()
alt.data_transformers.disable_max_rows()
c3 = alt.Chart(graph).mark_circle().encode(
alt.X('DAYS_BIRTH',scale=alt.Scale(zero=False)),
alt.Y('AMT_INCOME_TOTAL',scale=alt.Scale(domain=(0,600000))),
color=alt.condition(interval, 'Approval', alt.value('lightgray')),
tooltip = ['AMT_INCOME_TOTAL','DAYS_BIRTH']
).properties(
title = "Income vs Work Experience"
).add_selection(
interval
)
c4 = alt.Chart(graph).mark_bar().encode(
x ='Approval',
y = alt.Y('count()'),
color='Approval'
).transform_filter(
interval
)
c3|c4
Summary¶
We were able to create a KNeighborsClassifier() model to predict whether or not an applicant would get approved based on their net income and age.
The most difficult and time consuming part of this project was the beginning where I had to deal with two different datasets conected only by an ID number. There were many mismatches in dataset lengths and outside methods I had to use.
Overall, I really enjoyed this project. I love credit cards and it was fun determining which applicants got approved based on critera that I came up with.
References¶
Include references that you found helpful. Also say where you found the dataset you used.
Dataset Origin https://www.kaggle.com/ginaerian/credit-card-approval-prediction-gina
Delete row from dataframe with pandas method https://www.shanelynn.ie/pandas-drop-delete-dataframe-rows-columns/#:~:text=to be removed.-,Deleting rows using “drop” (best,for small numbers of rows)&text=To delete rows from a DataFrame, the drop function references,index when you run “data.
Reset Index of dataframe with pandas method https://www.machinelearningplus.com/pandas/pandas-reset-index/#:~:text=To reset the index in,()%20with%20the%20dataframe%20object.&text=On%20applying%20the%20.,dataframe%20as%20a%20separate%20column.
Sort dataset by specific row with pandas method https://pandas.pydata.org/docs/reference/api/pandas.Series.sort_values.html
Altair Interactivity Graph: https://altair-viz.github.io/altair-tutorial/notebooks/06-Selections.html
Created in Deepnote