Credit Card Approval Prediction¶

Author: Michael Choi

Course Project, UC Irvine, Math 10, W22

Introduction¶

I will be analyzing the following dataset that lists the following information on credit card apporval ratings. I want to see what professions, ages, education statues get approved and at what odds that they do get approved.

Main portion of the project¶

import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss

After importing libraries, we will load two datasets. The first dataset called App is a dataset of demographics of credit users such as their age, profession, income, etc. The next dataset is Cred, a dataset with credit records and payment history.

app= pd.read_csv('application_record.csv',header=None)
cred = pd.read_csv('credit_record.csv', header=None)

/shared-libs/python3.7/py-core/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3524: DtypeWarning: Columns (0,4,5,10,11,12,13,14,15,17) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3524: DtypeWarning: Columns (0,1) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)

cred.head()

	0	1	2
0	ID	MONTHS_BALANCE	STATUS
1	5001711	0	X
2	5001711	-1	0
3	5001711	-2	0
4	5001711	-3	0

Notice how the dataset has the column names we want as row 0. We will fix this by first renaming the columns and then deleting the first row. Finally, we will reindex the dataset with df.reset_index(). Note that these methods are apart of the pandas library.

More information about these methods can be found in the references section.

app = app.rename(columns = app.iloc[0,:]).drop([0]).reset_index(drop = True)
cred = cred.rename(columns = cred.iloc[0,:]).drop([0]).reset_index(drop = True)

cred.head()

	ID	MONTHS_BALANCE	STATUS
0	5001711	0	X
1	5001711	-1	0
2	5001711	-2	0
3	5001711	-3	0
4	5001712	0	C

Since there is no column that states if they got approved or not, I made my own based on who had a balance outstanding on their credit record. Months are listed as negative because it represents how many months ago (0 is current month, -1 is last month, etc.)

X means they didnt have a loan for the month, C means they paid it off. Everything else means they had a balance which I will record in the “Balance” column I am creating.

cred['Balance'] = cred['STATUS'].map(lambda x: 'N' if (x == 'C') or (x == 'X') else 'Y')

I noticed that, using pd.series.value_counts(), some IDs of the two datasets did not match i.e there were some ID numbers in the app dataset not in the cred dataset and vice versa.

Since I needed every applicant’s demographics to create a prediction and their credit record to determine if they are approved or not, I used pd.df.isin() to only use IDs present in both datasets.

cred['ID'].value_counts()

5022997    61
5117883    61
5146385    61
5091644    61
5118386    61
           ..
5079234     1
5079233     1
5002713     1
5135310     1
5149281     1
Name: ID, Length: 45986, dtype: int64

app['ID'].value_counts()

7137299    2
7702238    2
7282535    2
7243768    2
7050948    2
          ..
5690727    1
6621262    1
6621261    1
6621260    1
6842885    1
Name: ID, Length: 438510, dtype: int64

app = app[app["ID"].isin(cred["ID"].value_counts()[:].index)]
cred = cred[cred["ID"].isin(app["ID"].value_counts()[:].index)]

For a bit of more dataset cleaning up, I used the panda method pd.to_numeric() for the IDs because I wanted to sort both datasets by increasing ID numbers with pd. Reset indices again.

app['ID'] = pd.to_numeric(app['ID'])
cred['ID'] = pd.to_numeric(cred['ID'])
app = app.sort_values(by = ['ID']).reset_index(drop = True)
cred = cred.sort_values(by = ['ID']).reset_index(drop = True)

app.keys()

Index(['ID', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'FLAG_MOBIL', 'FLAG_WORK_PHONE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS'],
      dtype='object')

For later on in the dataset, we will change more columns to numeric. In addition, for birth days, I decided to divide by 365 to get years instead of days old each applicant is. I multipled by -1 since the column is listed as negatives such as the months column as described earlier. Same for the ‘DAYS_EMPLOYED’ column.

app['AMT_INCOME_TOTAL'] = pd.to_numeric(app['AMT_INCOME_TOTAL'])
app['DAYS_BIRTH'] = -pd.to_numeric(app['DAYS_BIRTH'])/365
app['DAYS_EMPLOYED'] = -pd.to_numeric(app['DAYS_EMPLOYED'])/365

This part was the trickest and longest part of my project process. Notice below how there are multiple rows of the same ID, one for each month and each month can have a different values in the ‘Balance’ column.

cred

	ID	MONTHS_BALANCE	STATUS	Balance
0	5008804	0	C	N
1	5008804	-15	X	N
2	5008804	-14	0	Y
3	5008804	-13	1	Y
4	5008804	-12	C	N
...	...	...	...	...
359769	5150479	-3	X	N
359770	5150479	-4	X	N
359771	5150479	-5	X	N
359772	5150479	-6	X	N
359773	5150479	-8	X	N

359774 rows × 4 columns

I needed to find a way to count how many ‘Y’ each ID row had. This will tell me how many months each applicant, identified by the ID number, had a balance on the credit record; that is, they paid late.

After trying many for loops, list comprehensions, creating new datasets, I decided to use the groupby method we learned in week 10 of class. This wil also be a part of my “Extra” section.

a = cred.groupby("ID")['Balance'].value_counts().sort_index(ascending=True)
print(a)

ID       Balance
5008804  N          14
         Y           2
5008805  N          13
         Y           2
5008806  N          23
                    ..
5150466  Y           2
5150467  N           1
         Y           6
5150468  Y           6
5150479  N           9
Name: Balance, Length: 28889, dtype: int64

At first, I tried using a for x,y loop as used in class; however, I had issues with breaking out of the loop and the datatype of y was a multiindex which was confusing to call from. Instead, I created a list for ID, Labels (Y or N), and the values themselves (months with or without a balance)

ids = a.keys().get_level_values('ID')
values = a.values
labels = [(a.keys()[i][1] == 'Y') for i in range(len(a))]

I created a new dataframe with a breakdown values grouped by ID known as Breakdown. I created a new dataframe called Denies. This is a dataset with rows that only carry a balance so I can the number of months each ID carried a credit card balance.

Breakdown = pd.DataFrame({'ID': ids, 'Labels':labels, 'Months': values})
Breakdown = Breakdown.sort_values(by = ['ID'])
Denies = Breakdown[Breakdown["Labels"] == True].reset_index(drop = True)
Denies.head()

	ID	Labels	Months
0	5008804	True	2
1	5008805	True	2
2	5008806	True	7
3	5008808	True	2
4	5008810	True	6

print(f"The number of months with an outstanding balance for the first ID is {Breakdown['Months'][0]}")

The number of months with an outstanding balance for the first ID is 14

Now we will create a new column in Denies that will deny applicants if they have more than 8 months were they carried a balance. We will add all these IDs to a list called denies.

Denies['Approval']= Breakdown['Months'].map(lambda x: 'N' if x > 8 else 'Y')
denies = list(Denies[Denies['Approval'] == 'N']['ID'].values)

From here, we will create a new column in the app dataset where applicants with IDs in the denies list are denied and the rest are given an approval status for now.

app['Approval'] = app['ID'].map(lambda x: 'N' if denies.count(x) else 'Y')

We will also deny applicants with an income less than 40000

for i in range(len(app)):
    if ((app['AMT_INCOME_TOTAL'][i] < 40000)):
        app['Approval'][i] = 'N'

/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until

app

	ID	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	NAME_INCOME_TYPE	NAME_EDUCATION_TYPE	NAME_FAMILY_STATUS	NAME_HOUSING_TYPE	DAYS_BIRTH	DAYS_EMPLOYED	FLAG_MOBIL	FLAG_WORK_PHONE	FLAG_PHONE	FLAG_EMAIL	OCCUPATION_TYPE	CNT_FAM_MEMBERS	Approval
0	5008804	M	Y	Y	0	427500.0	Working	Higher education	Civil marriage	Rented apartment	32.890411	12.443836	1	1	0	0	NaN	2.0	N
1	5008805	M	Y	Y	0	427500.0	Working	Higher education	Civil marriage	Rented apartment	32.890411	12.443836	1	1	0	0	NaN	2.0	Y
2	5008806	M	Y	Y	0	112500.0	Working	Secondary / secondary special	Married	House / apartment	58.832877	3.106849	1	0	0	0	Security staff	2.0	N
3	5008808	F	N	Y	0	270000.0	Commercial associate	Secondary / secondary special	Single / not married	House / apartment	52.356164	8.358904	1	0	1	1	Sales staff	1.0	Y
4	5008809	F	N	Y	0	270000.0	Commercial associate	Secondary / secondary special	Single / not married	House / apartment	52.356164	8.358904	1	0	1	1	Sales staff	1.0	Y
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
16612	5150465	F	Y	Y	1	112500.0	Working	Secondary / secondary special	Married	House / apartment	38.997260	0.846575	1	0	0	0	Core staff	3.0	N
16613	5150466	F	Y	Y	1	112500.0	Working	Secondary / secondary special	Married	House / apartment	38.997260	0.846575	1	0	0	0	Core staff	3.0	N
16614	5150467	F	N	Y	0	180000.0	Working	Secondary / secondary special	Married	House / apartment	48.528767	6.687671	1	0	0	0	Cooking staff	2.0	N
16615	5150468	F	N	Y	0	180000.0	Working	Secondary / secondary special	Married	House / apartment	48.528767	6.687671	1	0	0	0	Cooking staff	2.0	N
16616	5150479	F	N	N	0	90000.0	Working	Higher education	Married	With parents	34.010959	0.350685	1	0	0	0	NaN	2.0	Y

16617 rows × 19 columns

Now, onto Classification and graphing. We will use the following columns to predict if applicants get approved or not.

X = app[['AMT_INCOME_TOTAL','DAYS_EMPLOYED','DAYS_BIRTH']]
y = app['Approval']

We will scale the data

scalar = StandardScaler()
scalar.fit(X)
X_scaled = scalar.transform(X)

I want to determine which value of k for KNeighborsClassifier() will give me the lowest log_loss value. This is original code. Start by making a function loss(k).

def loss(k):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_scaled,y)
    return log_loss(y,clf.predict_proba(X_scaled))

Create a pandas dataset with two columns: one with a k-value and the other with the associated log_loss value from the function we created above. Will take a while to load, 30 to 50 seconds. Afterwards, graph this dataset.

list(range(1,50))
log = pd.DataFrame({'k-value':list(range(1,100))})
log['loss'] = [loss(x) for x in log['k-value']]
log

	k-value	loss
0	1	11.213619
1	2	4.439504
2	3	2.204020
3	4	1.357061
4	5	0.901659
...	...	...
94	95	0.679062
95	96	0.678998
96	97	0.679091
97	98	0.678887
98	99	0.678756

99 rows × 2 columns

alt.Chart(log).mark_circle().encode(
    x = 'k-value',
    y = 'loss',
    tooltip = ['k-value','loss']
)

Close examination of the chart and graph with tooltip will show k = 10 gives the lowest log_loss value

clf = KNeighborsClassifier(n_neighbors=10)
clf.fit(X_scaled,y)

KNeighborsClassifier(n_neighbors=10)

It is important to split data into a training set and test set. This allow us to train the model and then have values to test. We can also look for signs of overfitting if test score is much lower than training score. This occurs when the model follows the training set too closely.

X_train, X_test, y_train, y_test = train_test_split(X_scaled,y, test_size = .3)
(clf.score(X_train,y_train),clf.score(X_test,y_test))

(0.6614220617315794, 0.6430004011231448)

Since the training and test set scores are roughly the same, we do not have signs of overfitting

log_loss(y,clf.predict_proba(X_scaled))

0.6154558798665394

Now we will graph the dataset. However, we can see the range contains lots of outliers. We will create a new dataset with incomes less than 600000 only called graph

alt.data_transformers.disable_max_rows()
alt.Chart(app).mark_circle().encode(
    alt.X('DAYS_BIRTH', scale=alt.Scale(zero=False)),
    alt.Y('AMT_INCOME_TOTAL',scale=alt.Scale(zero=False)),
    color = 'Approval',
    tooltip = ['DAYS_BIRTH','AMT_INCOME_TOTAL']
).properties(
    title = "Age vs Income"
)

graph = app[app['AMT_INCOME_TOTAL'] < 600000]

alt.data_transformers.disable_max_rows()
c1 = alt.Chart(graph).mark_circle().encode(
    alt.X('DAYS_BIRTH',scale=alt.Scale(zero= False)),
    alt.Y('AMT_INCOME_TOTAL',scale=alt.Scale(domain=(0,600000))),
    color = 'Approval',
    tooltip = ['DAYS_BIRTH','AMT_INCOME_TOTAL']
).properties(
    title = "Age vs Income"
)
c1

We will make a second Graph with predicted Approval to compare

app["pred"] = clf.predict(X_scaled)
graph = app[app['AMT_INCOME_TOTAL'] < 600000]

app

	ID	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	NAME_INCOME_TYPE	NAME_EDUCATION_TYPE	NAME_FAMILY_STATUS	NAME_HOUSING_TYPE	DAYS_BIRTH	DAYS_EMPLOYED	FLAG_MOBIL	FLAG_WORK_PHONE	FLAG_PHONE	FLAG_EMAIL	OCCUPATION_TYPE	CNT_FAM_MEMBERS	Approval	pred
0	5008804	M	Y	Y	0	427500.0	Working	Higher education	Civil marriage	Rented apartment	32.890411	12.443836	1	1	0	0	NaN	2.0	N	Y
1	5008805	M	Y	Y	0	427500.0	Working	Higher education	Civil marriage	Rented apartment	32.890411	12.443836	1	1	0	0	NaN	2.0	Y	Y
2	5008806	M	Y	Y	0	112500.0	Working	Secondary / secondary special	Married	House / apartment	58.832877	3.106849	1	0	0	0	Security staff	2.0	N	Y
3	5008808	F	N	Y	0	270000.0	Commercial associate	Secondary / secondary special	Single / not married	House / apartment	52.356164	8.358904	1	0	1	1	Sales staff	1.0	Y	Y
4	5008809	F	N	Y	0	270000.0	Commercial associate	Secondary / secondary special	Single / not married	House / apartment	52.356164	8.358904	1	0	1	1	Sales staff	1.0	Y	Y
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
16612	5150465	F	Y	Y	1	112500.0	Working	Secondary / secondary special	Married	House / apartment	38.997260	0.846575	1	0	0	0	Core staff	3.0	N	N
16613	5150466	F	Y	Y	1	112500.0	Working	Secondary / secondary special	Married	House / apartment	38.997260	0.846575	1	0	0	0	Core staff	3.0	N	N
16614	5150467	F	N	Y	0	180000.0	Working	Secondary / secondary special	Married	House / apartment	48.528767	6.687671	1	0	0	0	Cooking staff	2.0	N	N
16615	5150468	F	N	Y	0	180000.0	Working	Secondary / secondary special	Married	House / apartment	48.528767	6.687671	1	0	0	0	Cooking staff	2.0	N	N
16616	5150479	F	N	N	0	90000.0	Working	Higher education	Married	With parents	34.010959	0.350685	1	0	0	0	NaN	2.0	Y	N

16617 rows × 20 columns

alt.data_transformers.disable_max_rows()
c2 = alt.Chart(graph).mark_circle().encode(
    alt.X('DAYS_BIRTH',scale=alt.Scale(zero= False)),
    alt.Y('AMT_INCOME_TOTAL',scale=alt.Scale(domain=(0,600000))),
    color = 'pred',
    tooltip = ['DAYS_BIRTH','AMT_INCOME_TOTAL']
).properties(
    title = "Age vs Income"
)
c1

c2

For fun, here is an additional graph with more interactive elements such as selection and a bar chart. I got some of this code from altair-viz.github.io which will be listed in the refrences section.

interval = alt.selection_interval()
alt.data_transformers.disable_max_rows()
c3 = alt.Chart(graph).mark_circle().encode(
    alt.X('DAYS_BIRTH',scale=alt.Scale(zero=False)),
    alt.Y('AMT_INCOME_TOTAL',scale=alt.Scale(domain=(0,600000))),
    color=alt.condition(interval, 'Approval', alt.value('lightgray')),
    tooltip = ['AMT_INCOME_TOTAL','DAYS_BIRTH']
).properties(
    title = "Income vs Work Experience"
).add_selection(
    interval
)

c4 = alt.Chart(graph).mark_bar().encode(
    x ='Approval',
    y = alt.Y('count()'),
    color='Approval'
).transform_filter(
    interval
)

c3|c4

Summary¶

We were able to create a KNeighborsClassifier() model to predict whether or not an applicant would get approved based on their net income and age.

The most difficult and time consuming part of this project was the beginning where I had to deal with two different datasets conected only by an ID number. There were many mismatches in dataset lengths and outside methods I had to use.

Overall, I really enjoyed this project. I love credit cards and it was fun determining which applicants got approved based on critera that I came up with.

References¶

Include references that you found helpful. Also say where you found the dataset you used.

Dataset Origin https://www.kaggle.com/ginaerian/credit-card-approval-prediction-gina

Delete row from dataframe with pandas method https://www.shanelynn.ie/pandas-drop-delete-dataframe-rows-columns/#:~:text=to be removed.-,Deleting rows using “drop” (best,for small numbers of rows)&text=To delete rows from a DataFrame, the drop function references,index when you run “data.

Reset Index of dataframe with pandas method https://www.machinelearningplus.com/pandas/pandas-reset-index/#:~:text=To reset the index in,()%20with%20the%20dataframe%20object.&text=On%20applying%20the%20.,dataframe%20as%20a%20separate%20column.

Sort dataset by specific row with pandas method https://pandas.pydata.org/docs/reference/api/pandas.Series.sort_values.html

Altair Interactivity Graph: https://altair-viz.github.io/altair-tutorial/notebooks/06-Selections.html

Created in Deepnote

UC Irvine Math 10 W22

Credit Card Approval Prediction

Contents

Credit Card Approval Prediction¶

Introduction¶

Main portion of the project¶

Summary¶

References¶