Credit Card Approval Prediction

Author: Michael Choi

Course Project, UC Irvine, Math 10, W22

Introduction

I will be analyzing the following dataset that lists the following information on credit card apporval ratings. I want to see what professions, ages, education statues get approved and at what odds that they do get approved.

Main portion of the project

import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss

After importing libraries, we will load two datasets. The first dataset called App is a dataset of demographics of credit users such as their age, profession, income, etc. The next dataset is Cred, a dataset with credit records and payment history.

app= pd.read_csv('application_record.csv',header=None)
cred = pd.read_csv('credit_record.csv', header=None)
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3524: DtypeWarning: Columns (0,4,5,10,11,12,13,14,15,17) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3524: DtypeWarning: Columns (0,1) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)
cred.head()
0 1 2
0 ID MONTHS_BALANCE STATUS
1 5001711 0 X
2 5001711 -1 0
3 5001711 -2 0
4 5001711 -3 0

Notice how the dataset has the column names we want as row 0. We will fix this by first renaming the columns and then deleting the first row. Finally, we will reindex the dataset with df.reset_index(). Note that these methods are apart of the pandas library.

More information about these methods can be found in the references section.

app = app.rename(columns = app.iloc[0,:]).drop([0]).reset_index(drop = True)
cred = cred.rename(columns = cred.iloc[0,:]).drop([0]).reset_index(drop = True)
cred.head()
ID MONTHS_BALANCE STATUS
0 5001711 0 X
1 5001711 -1 0
2 5001711 -2 0
3 5001711 -3 0
4 5001712 0 C

Since there is no column that states if they got approved or not, I made my own based on who had a balance outstanding on their credit record. Months are listed as negative because it represents how many months ago (0 is current month, -1 is last month, etc.)

X means they didnt have a loan for the month, C means they paid it off. Everything else means they had a balance which I will record in the “Balance” column I am creating.

cred['Balance'] = cred['STATUS'].map(lambda x: 'N' if (x == 'C') or (x == 'X') else 'Y')

I noticed that, using pd.series.value_counts(), some IDs of the two datasets did not match i.e there were some ID numbers in the app dataset not in the cred dataset and vice versa.

Since I needed every applicant’s demographics to create a prediction and their credit record to determine if they are approved or not, I used pd.df.isin() to only use IDs present in both datasets.

cred['ID'].value_counts()
5022997    61
5117883    61
5146385    61
5091644    61
5118386    61
           ..
5079234     1
5079233     1
5002713     1
5135310     1
5149281     1
Name: ID, Length: 45986, dtype: int64
app['ID'].value_counts()
7137299    2
7702238    2
7282535    2
7243768    2
7050948    2
          ..
5690727    1
6621262    1
6621261    1
6621260    1
6842885    1
Name: ID, Length: 438510, dtype: int64
app = app[app["ID"].isin(cred["ID"].value_counts()[:].index)]
cred = cred[cred["ID"].isin(app["ID"].value_counts()[:].index)]

For a bit of more dataset cleaning up, I used the panda method pd.to_numeric() for the IDs because I wanted to sort both datasets by increasing ID numbers with pd. Reset indices again.

app['ID'] = pd.to_numeric(app['ID'])
cred['ID'] = pd.to_numeric(cred['ID'])
app = app.sort_values(by = ['ID']).reset_index(drop = True)
cred = cred.sort_values(by = ['ID']).reset_index(drop = True)
app.keys()
Index(['ID', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'FLAG_MOBIL', 'FLAG_WORK_PHONE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS'],
      dtype='object')

For later on in the dataset, we will change more columns to numeric. In addition, for birth days, I decided to divide by 365 to get years instead of days old each applicant is. I multipled by -1 since the column is listed as negatives such as the months column as described earlier. Same for the ‘DAYS_EMPLOYED’ column.

app['AMT_INCOME_TOTAL'] = pd.to_numeric(app['AMT_INCOME_TOTAL'])
app['DAYS_BIRTH'] = -pd.to_numeric(app['DAYS_BIRTH'])/365
app['DAYS_EMPLOYED'] = -pd.to_numeric(app['DAYS_EMPLOYED'])/365

This part was the trickest and longest part of my project process. Notice below how there are multiple rows of the same ID, one for each month and each month can have a different values in the ‘Balance’ column.

cred
ID MONTHS_BALANCE STATUS Balance
0 5008804 0 C N
1 5008804 -15 X N
2 5008804 -14 0 Y
3 5008804 -13 1 Y
4 5008804 -12 C N
... ... ... ... ...
359769 5150479 -3 X N
359770 5150479 -4 X N
359771 5150479 -5 X N
359772 5150479 -6 X N
359773 5150479 -8 X N

359774 rows × 4 columns

I needed to find a way to count how many ‘Y’ each ID row had. This will tell me how many months each applicant, identified by the ID number, had a balance on the credit record; that is, they paid late.

After trying many for loops, list comprehensions, creating new datasets, I decided to use the groupby method we learned in week 10 of class. This wil also be a part of my “Extra” section.

a = cred.groupby("ID")['Balance'].value_counts().sort_index(ascending=True)
print(a)
ID       Balance
5008804  N          14
         Y           2
5008805  N          13
         Y           2
5008806  N          23
                    ..
5150466  Y           2
5150467  N           1
         Y           6
5150468  Y           6
5150479  N           9
Name: Balance, Length: 28889, dtype: int64

At first, I tried using a for x,y loop as used in class; however, I had issues with breaking out of the loop and the datatype of y was a multiindex which was confusing to call from. Instead, I created a list for ID, Labels (Y or N), and the values themselves (months with or without a balance)

ids = a.keys().get_level_values('ID')
values = a.values
labels = [(a.keys()[i][1] == 'Y') for i in range(len(a))]

I created a new dataframe with a breakdown values grouped by ID known as Breakdown. I created a new dataframe called Denies. This is a dataset with rows that only carry a balance so I can the number of months each ID carried a credit card balance.

Breakdown = pd.DataFrame({'ID': ids, 'Labels':labels, 'Months': values})
Breakdown = Breakdown.sort_values(by = ['ID'])
Denies = Breakdown[Breakdown["Labels"] == True].reset_index(drop = True)
Denies.head()
ID Labels Months
0 5008804 True 2
1 5008805 True 2
2 5008806 True 7
3 5008808 True 2
4 5008810 True 6
print(f"The number of months with an outstanding balance for the first ID is {Breakdown['Months'][0]}")
The number of months with an outstanding balance for the first ID is 14

Now we will create a new column in Denies that will deny applicants if they have more than 8 months were they carried a balance. We will add all these IDs to a list called denies.

Denies['Approval']= Breakdown['Months'].map(lambda x: 'N' if x > 8 else 'Y')
denies = list(Denies[Denies['Approval'] == 'N']['ID'].values)

From here, we will create a new column in the app dataset where applicants with IDs in the denies list are denied and the rest are given an approval status for now.

app['Approval'] = app['ID'].map(lambda x: 'N' if denies.count(x) else 'Y')

We will also deny applicants with an income less than 40000

for i in range(len(app)):
    if ((app['AMT_INCOME_TOTAL'][i] < 40000)):
        app['Approval'][i] = 'N'
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
app
ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE DAYS_BIRTH DAYS_EMPLOYED FLAG_MOBIL FLAG_WORK_PHONE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS Approval
0 5008804 M Y Y 0 427500.0 Working Higher education Civil marriage Rented apartment 32.890411 12.443836 1 1 0 0 NaN 2.0 N
1 5008805 M Y Y 0 427500.0 Working Higher education Civil marriage Rented apartment 32.890411 12.443836 1 1 0 0 NaN 2.0 Y
2 5008806 M Y Y 0 112500.0 Working Secondary / secondary special Married House / apartment 58.832877 3.106849 1 0 0 0 Security staff 2.0 N
3 5008808 F N Y 0 270000.0 Commercial associate Secondary / secondary special Single / not married House / apartment 52.356164 8.358904 1 0 1 1 Sales staff 1.0 Y
4 5008809 F N Y 0 270000.0 Commercial associate Secondary / secondary special Single / not married House / apartment 52.356164 8.358904 1 0 1 1 Sales staff 1.0 Y
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16612 5150465 F Y Y 1 112500.0 Working Secondary / secondary special Married House / apartment 38.997260 0.846575 1 0 0 0 Core staff 3.0 N
16613 5150466 F Y Y 1 112500.0 Working Secondary / secondary special Married House / apartment 38.997260 0.846575 1 0 0 0 Core staff 3.0 N
16614 5150467 F N Y 0 180000.0 Working Secondary / secondary special Married House / apartment 48.528767 6.687671 1 0 0 0 Cooking staff 2.0 N
16615 5150468 F N Y 0 180000.0 Working Secondary / secondary special Married House / apartment 48.528767 6.687671 1 0 0 0 Cooking staff 2.0 N
16616 5150479 F N N 0 90000.0 Working Higher education Married With parents 34.010959 0.350685 1 0 0 0 NaN 2.0 Y

16617 rows × 19 columns

Now, onto Classification and graphing. We will use the following columns to predict if applicants get approved or not.

X = app[['AMT_INCOME_TOTAL','DAYS_EMPLOYED','DAYS_BIRTH']]
y = app['Approval']

We will scale the data

scalar = StandardScaler()
scalar.fit(X)
X_scaled = scalar.transform(X)

I want to determine which value of k for KNeighborsClassifier() will give me the lowest log_loss value. This is original code. Start by making a function loss(k).

def loss(k):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_scaled,y)
    return log_loss(y,clf.predict_proba(X_scaled))

Create a pandas dataset with two columns: one with a k-value and the other with the associated log_loss value from the function we created above. Will take a while to load, 30 to 50 seconds. Afterwards, graph this dataset.

list(range(1,50))
log = pd.DataFrame({'k-value':list(range(1,100))})
log['loss'] = [loss(x) for x in log['k-value']]
log
k-value loss
0 1 11.213619
1 2 4.439504
2 3 2.204020
3 4 1.357061
4 5 0.901659
... ... ...
94 95 0.679062
95 96 0.678998
96 97 0.679091
97 98 0.678887
98 99 0.678756

99 rows × 2 columns

alt.Chart(log).mark_circle().encode(
    x = 'k-value',
    y = 'loss',
    tooltip = ['k-value','loss']
)

Close examination of the chart and graph with tooltip will show k = 10 gives the lowest log_loss value

clf = KNeighborsClassifier(n_neighbors=10)
clf.fit(X_scaled,y)
KNeighborsClassifier(n_neighbors=10)

It is important to split data into a training set and test set. This allow us to train the model and then have values to test. We can also look for signs of overfitting if test score is much lower than training score. This occurs when the model follows the training set too closely.

X_train, X_test, y_train, y_test = train_test_split(X_scaled,y, test_size = .3)
(clf.score(X_train,y_train),clf.score(X_test,y_test))
(0.6614220617315794, 0.6430004011231448)

Since the training and test set scores are roughly the same, we do not have signs of overfitting

log_loss(y,clf.predict_proba(X_scaled))
0.6154558798665394

Now we will graph the dataset. However, we can see the range contains lots of outliers. We will create a new dataset with incomes less than 600000 only called graph

alt.data_transformers.disable_max_rows()
alt.Chart(app).mark_circle().encode(
    alt.X('DAYS_BIRTH', scale=alt.Scale(zero=False)),
    alt.Y('AMT_INCOME_TOTAL',scale=alt.Scale(zero=False)),
    color = 'Approval',
    tooltip = ['DAYS_BIRTH','AMT_INCOME_TOTAL']
).properties(
    title = "Age vs Income"
)
graph = app[app['AMT_INCOME_TOTAL'] < 600000]
alt.data_transformers.disable_max_rows()
c1 = alt.Chart(graph).mark_circle().encode(
    alt.X('DAYS_BIRTH',scale=alt.Scale(zero= False)),
    alt.Y('AMT_INCOME_TOTAL',scale=alt.Scale(domain=(0,600000))),
    color = 'Approval',
    tooltip = ['DAYS_BIRTH','AMT_INCOME_TOTAL']
).properties(
    title = "Age vs Income"
)
c1

We will make a second Graph with predicted Approval to compare

app["pred"] = clf.predict(X_scaled)
graph = app[app['AMT_INCOME_TOTAL'] < 600000]
app
ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE DAYS_BIRTH DAYS_EMPLOYED FLAG_MOBIL FLAG_WORK_PHONE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS Approval pred
0 5008804 M Y Y 0 427500.0 Working Higher education Civil marriage Rented apartment 32.890411 12.443836 1 1 0 0 NaN 2.0 N Y
1 5008805 M Y Y 0 427500.0 Working Higher education Civil marriage Rented apartment 32.890411 12.443836 1 1 0 0 NaN 2.0 Y Y
2 5008806 M Y Y 0 112500.0 Working Secondary / secondary special Married House / apartment 58.832877 3.106849 1 0 0 0 Security staff 2.0 N Y
3 5008808 F N Y 0 270000.0 Commercial associate Secondary / secondary special Single / not married House / apartment 52.356164 8.358904 1 0 1 1 Sales staff 1.0 Y Y
4 5008809 F N Y 0 270000.0 Commercial associate Secondary / secondary special Single / not married House / apartment 52.356164 8.358904 1 0 1 1 Sales staff 1.0 Y Y
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16612 5150465 F Y Y 1 112500.0 Working Secondary / secondary special Married House / apartment 38.997260 0.846575 1 0 0 0 Core staff 3.0 N N
16613 5150466 F Y Y 1 112500.0 Working Secondary / secondary special Married House / apartment 38.997260 0.846575 1 0 0 0 Core staff 3.0 N N
16614 5150467 F N Y 0 180000.0 Working Secondary / secondary special Married House / apartment 48.528767 6.687671 1 0 0 0 Cooking staff 2.0 N N
16615 5150468 F N Y 0 180000.0 Working Secondary / secondary special Married House / apartment 48.528767 6.687671 1 0 0 0 Cooking staff 2.0 N N
16616 5150479 F N N 0 90000.0 Working Higher education Married With parents 34.010959 0.350685 1 0 0 0 NaN 2.0 Y N

16617 rows × 20 columns

alt.data_transformers.disable_max_rows()
c2 = alt.Chart(graph).mark_circle().encode(
    alt.X('DAYS_BIRTH',scale=alt.Scale(zero= False)),
    alt.Y('AMT_INCOME_TOTAL',scale=alt.Scale(domain=(0,600000))),
    color = 'pred',
    tooltip = ['DAYS_BIRTH','AMT_INCOME_TOTAL']
).properties(
    title = "Age vs Income"
)
c1
c2

For fun, here is an additional graph with more interactive elements such as selection and a bar chart. I got some of this code from altair-viz.github.io which will be listed in the refrences section.

interval = alt.selection_interval()
alt.data_transformers.disable_max_rows()
c3 = alt.Chart(graph).mark_circle().encode(
    alt.X('DAYS_BIRTH',scale=alt.Scale(zero=False)),
    alt.Y('AMT_INCOME_TOTAL',scale=alt.Scale(domain=(0,600000))),
    color=alt.condition(interval, 'Approval', alt.value('lightgray')),
    tooltip = ['AMT_INCOME_TOTAL','DAYS_BIRTH']
).properties(
    title = "Income vs Work Experience"
).add_selection(
    interval
)

c4 = alt.Chart(graph).mark_bar().encode(
    x ='Approval',
    y = alt.Y('count()'),
    color='Approval'
).transform_filter(
    interval
)
c3|c4

Summary

We were able to create a KNeighborsClassifier() model to predict whether or not an applicant would get approved based on their net income and age.

The most difficult and time consuming part of this project was the beginning where I had to deal with two different datasets conected only by an ID number. There were many mismatches in dataset lengths and outside methods I had to use.

Overall, I really enjoyed this project. I love credit cards and it was fun determining which applicants got approved based on critera that I came up with.

References

Include references that you found helpful. Also say where you found the dataset you used.

Dataset Origin https://www.kaggle.com/ginaerian/credit-card-approval-prediction-gina

Delete row from dataframe with pandas method https://www.shanelynn.ie/pandas-drop-delete-dataframe-rows-columns/#:~:text=to be removed.-,Deleting rows using “drop” (best,for small numbers of rows)&text=To delete rows from a DataFrame, the drop function references,index when you run “data.

Reset Index of dataframe with pandas method https://www.machinelearningplus.com/pandas/pandas-reset-index/#:~:text=To reset the index in,()%20with%20the%20dataframe%20object.&text=On%20applying%20the%20.,dataframe%20as%20a%20separate%20column.

Sort dataset by specific row with pandas method https://pandas.pydata.org/docs/reference/api/pandas.Series.sort_values.html

Altair Interactivity Graph: https://altair-viz.github.io/altair-tutorial/notebooks/06-Selections.html

Created in deepnote.com Created in Deepnote