Analysis of credit card fraud

Author: Linglin Tian

Course Project, UC Irvine, Math 10, Spring 2022

Introduction

With the development of technology, the digital payment become more and more common in the society. This is because, more people feel that it is more convenient for them to use digital payment. Moreover, there are also more cyber criminals stealing peoples’ money when people use card for payment. Therefore the number of credit car fraud is increasing. The main goal of project is to give some suggestions about how to prevent the credit fraud.

In this project, We will use the dataset called “card_transdata” to investigate the feature of credit car fraud. Moreover, We will also analyze the relationship between features of those credit car fraud and the probability of occuring a fraudulent transaction.

Main portion of the project

import pandas as pd 
import numpy as np
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.neighbors import KNeighborsClassifier

Import & Clean Data

#import and clean data
df=pd.read_csv("card_transdata.csv")
df=df.dropna(axis=1)
df.head()
df.describe()
distance_from_home distance_from_last_transaction ratio_to_median_purchase_price repeat_retailer used_chip used_pin_number online_order fraud
count 1000000.000000 1000000.000000 1000000.000000 1000000.000000 1000000.000000 1000000.000000 1000000.000000 1000000.000000
mean 26.628792 5.036519 1.824182 0.881536 0.350399 0.100608 0.650552 0.087403
std 65.390784 25.843093 2.799589 0.323157 0.477095 0.300809 0.476796 0.282425
min 0.004874 0.000118 0.004399 0.000000 0.000000 0.000000 0.000000 0.000000
25% 3.878008 0.296671 0.475673 1.000000 0.000000 0.000000 0.000000 0.000000
50% 9.967760 0.998650 0.997717 1.000000 0.000000 0.000000 1.000000 0.000000
75% 25.743985 3.355748 2.096370 1.000000 1.000000 0.000000 1.000000 0.000000
max 10632.723672 11851.104565 267.802942 1.000000 1.000000 1.000000 1.000000 1.000000

Rescaling the Data

#Rescaling the Data
columns=["distance_from_home","distance_from_last_transaction","ratio_to_median_purchase_price","repeat_retailer"]
scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit(df[columns])
df[columns] = scaler.transform(df[columns])
df.describe()
distance_from_home distance_from_last_transaction ratio_to_median_purchase_price repeat_retailer used_chip used_pin_number online_order fraud
count 1.000000e+06 1.000000e+06 1.000000e+06 1.000000e+06 1000000.000000 1000000.000000 1000000.000000 1000000.000000
mean 2.274163e-16 6.134027e-17 -2.633751e-16 -1.060485e-17 0.350399 0.100608 0.650552 0.087403
std 1.000001e+00 1.000001e+00 1.000001e+00 1.000001e+00 0.477095 0.300809 0.476796 0.282425
min -4.071511e-01 -1.948839e-01 -6.500182e-01 -2.727890e+00 0.000000 0.000000 0.000000 0.000000
25% -3.479205e-01 -1.834088e-01 -4.816812e-01 3.665837e-01 0.000000 0.000000 0.000000 0.000000
50% -2.547919e-01 -1.562457e-01 -2.952096e-01 3.665837e-01 0.000000 0.000000 1.000000 0.000000
75% -1.353107e-02 -6.503759e-02 9.722443e-02 3.665837e-01 1.000000 0.000000 1.000000 0.000000
max 1.621956e+02 4.583845e+02 9.500641e+01 3.665837e-01 1.000000 1.000000 1.000000 1.000000
#Making the columns' name more visible in the Data
#Rename the columns have long name. 
df.rename(columns={"distance_from_home":"d-home","distance_from_last_transaction":"d-lastT","ratio_to_median_purchase_price":"p/meanp"}, inplace=True)
df
d-home d-lastT p/meanp repeat_retailer used_chip used_pin_number online_order fraud
0 0.477882 -0.182849 0.043491 0.366584 1.0 0.0 0.0 0.0
1 -0.241607 -0.188094 -0.189300 0.366584 0.0 0.0 0.0 0.0
2 -0.329369 -0.163733 -0.498812 0.366584 0.0 0.0 1.0 0.0
3 -0.372854 0.021806 -0.522048 0.366584 1.0 0.0 1.0 0.0
4 0.268572 -0.172968 0.142373 0.366584 1.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ...
999995 -0.373473 -0.190529 -0.070505 0.366584 1.0 0.0 0.0 0.0
999996 -0.103318 -0.091035 0.340808 0.366584 1.0 0.0 0.0 0.0
999997 -0.362650 -0.137903 -0.573694 0.366584 1.0 0.0 1.0 0.0
999998 -0.342098 -0.185523 -0.481628 0.366584 0.0 0.0 1.0 0.0
999999 0.481403 -0.182579 -0.513384 0.366584 1.0 0.0 1.0 0.0

1000000 rows × 8 columns

ncolumn=["d-home","d-lastT","p/meanp","repeat_retailer","used_chip","used_pin_number","online_order"]

“d-home” means “the distance from home where the transaction happened”

“d-lastT” means “the distance from last transaction happened”

“p/meanp” means “the ratio between Ratio of purchased price transaction to median purchase price”

Logistic Regression (Finding the relationship)

X_train, X_test, y_train, y_test = train_test_split(
    df[ncolumn], df["fraud"], test_size = 0.3)
clf = LogisticRegression()
clf.fit(X_train, y_train)
clf.predict(X_train)
clf.predict_proba(X_train)
array([[1.79341797e-02, 9.82065820e-01],
       [9.99937512e-01, 6.24876526e-05],
       [9.61628062e-01, 3.83719381e-02],
       ...,
       [9.66255204e-01, 3.37447955e-02],
       [9.89671832e-01, 1.03281684e-02],
       [9.99989612e-01, 1.03879242e-05]])
#Change numpy array to list 
arr=clf.coef_[0]
coef=arr.tolist()
coef
[0.9794989676848387,
 0.6563792613964299,
 2.387060006684409,
 -0.1977487331367303,
 -1.04710565370457,
 -13.211175967156898,
 6.507723297259031]
list(clf.coef_[0])
[0.9794989676848387,
 0.6563792613964299,
 2.387060006684409,
 -0.1977487331367303,
 -1.04710565370457,
 -13.211175967156898,
 6.507723297259031]
max(coef)
6.507723297259031
my_dict={ncolumn[i]:coef[i] for i in range(len(ncolumn))}
my_dict
{'d-home': 0.9794989676848387,
 'd-lastT': 0.6563792613964299,
 'p/meanp': 2.387060006684409,
 'repeat_retailer': -0.1977487331367303,
 'used_chip': -1.04710565370457,
 'used_pin_number': -13.211175967156898,
 'online_order': 6.507723297259031}
for x in my_dict:
    print(f"the coefficent between {x} and fraud is {my_dict[x]}")
the coefficent between d-home and fraud is 0.9794989676848387
the coefficent between d-lastT and fraud is 0.6563792613964299
the coefficent between p/meanp and fraud is 2.387060006684409
the coefficent between repeat_retailer and fraud is -0.1977487331367303
the coefficent between used_chip and fraud is -1.04710565370457
the coefficent between used_pin_number and fraud is -13.211175967156898
the coefficent between online_order and fraud is 6.507723297259031

As we can see,the high absolute vaule of coefficent is |-13.816568335348947| from the “used_pin_number” column. In that case, it shows that the using pin number during transction can have a strongest influence on whether the transaction is fraud. Moreover, the coefficent is negative means that if the number times of using pin number in tranction increased, the probability of being fraud will decrease at that time.

mean_absolute_error(clf.predict(X_test), y_test)
0.04092666666666667
mean_absolute_error(clf.predict(X_train), y_train)
0.04167285714285714

According to the mean_absolute_error values, we can see the error of train set is smaller than the errot of the test set. Therefore, this prediction of logistic regression is almost accurate.

Visualization The Relationship

Create a New Dataframe which contains the coefficents of different factors. Create a Bar Chart to see which factor has most influence. Create a linear Chart to see the relationship between “Fraud” and each factor.

#Create a New Dataframe of the coefficents
df_coef = pd.DataFrame()
df_coef["factor"] = ncolumn
df_coef["value"] = coef
df_coef["type"] = ['negative' if m < 0 else 'positive'for m in coef]
df_coef
factor value type
0 d-home 0.979499 positive
1 d-lastT 0.656379 positive
2 p/meanp 2.387060 positive
3 repeat_retailer -0.197749 negative
4 used_chip -1.047106 negative
5 used_pin_number -13.211176 negative
6 online_order 6.507723 positive
#Create a Bar chart with interactive 
single = alt.selection_single()
alt.Chart(df_coef).mark_bar().encode(
    x = 'factor',
    y = 'value',
    color=alt.condition(single, 'type', alt.value('lightgray')),
    opacity=alt.value(0.5),
).properties(
    title = 'Factor Influnence'
).add_selection(
    single
)

According to the graph, used_pin_number influenced whether the transaction is fraud in a most strong way.

ncolumn2=["d-home","d-lastT","p/meanp","repeat_retailer"]
df2=df.iloc[:1000,:]
chart_list = []
for i in ncolumn:
    c_temp = alt.Chart(df2).mark_line(color="red", clip=True).encode(
        x=i,
        y=alt.Y("fraud",scale=alt.Scale(domain=(0,3)))
    )
    chart_list.append(c_temp)
alt.vconcat(*chart_list)

According to those graphs, we can see that the distance from home which is between 4 miles to 11 miles can make the probability of fraud transaction is high. For the factor called “d-lastT”, the distance from last transaction happened which is between 5 miles to 15 miles can have high probability of fraud transaction. Moreover, the probability of fraud transaction will decreased as the distance increase from 20 miles to 40 miles. For the factor called “p/meanp”, the most fraud transaction most happened in the range 5 to 6. After 6, the fraud transaction decreased. The relationship between repeat_retailer is negative. Since the “used_chip”, “used_pin_number”, “online_order” are only have the value 0 and 1 for false and true. The graph doest not show a linear relation in visual, but we can the the coefficients to see the relationships.

KNeighborsClassifier

Using KNeighborsClassifier to predcict whether the transaction is fraud given the imformation about the features of the transaction.

from sklearn.neighbors import KNeighborsClassifier
df2=df.iloc[:1000,:].copy() #use only the data in fisrt 1000 rows since the dataset is so large
df2["fraud2"] = "Yes"       #add new column in df2
df2.loc[df2['fraud'] == 0, 'fraud2'] = "No"   #change the value 0 to No, which means the transaction is not fraud. 
neigh = KNeighborsClassifier(n_neighbors=6)
neigh.fit(df2[ncolumn],df2["fraud2"])
df2['Pred']=neigh.predict(df2[ncolumn])
df2
d-home d-lastT p/meanp repeat_retailer used_chip used_pin_number online_order fraud fraud2 Pred
0 0.477882 -0.182849 0.043491 0.366584 1.0 0.0 0.0 0.0 No No
1 -0.241607 -0.188094 -0.189300 0.366584 0.0 0.0 0.0 0.0 No No
2 -0.329369 -0.163733 -0.498812 0.366584 0.0 0.0 1.0 0.0 No No
3 -0.372854 0.021806 -0.522048 0.366584 1.0 0.0 1.0 0.0 No No
4 0.268572 -0.172968 0.142373 0.366584 1.0 0.0 1.0 0.0 No No
... ... ... ... ... ... ... ... ... ... ...
995 -0.256235 -0.155319 -0.527770 0.366584 0.0 0.0 0.0 0.0 No No
996 2.163347 0.049059 -0.502699 0.366584 1.0 0.0 0.0 0.0 No No
997 0.266381 -0.175142 -0.428447 0.366584 0.0 0.0 1.0 0.0 No No
998 -0.361372 -0.187394 -0.386045 0.366584 1.0 0.0 1.0 0.0 No No
999 0.175937 -0.188424 0.764918 0.366584 0.0 0.0 1.0 0.0 No No

1000 rows × 10 columns

neigh.score(df2[ncolumn], df2["fraud2"])
0.976

The score is so high, which means the accuracy of the KNeighborsClassifier prediction is high.

Visualization The Prediction

ncolumn3=["d-lastT","p/meanp","repeat_retailer","used_chip","used_pin_number","online_order"]
chart_list1=[]
for m in ncolumn3:
    c_temp1=alt.Chart(df2).mark_circle().encode(
            x=alt.X("d-home", scale=alt.Scale(zero=False)),
            y=alt.Y(m, scale=alt.Scale(zero=False)),
            color="Pred"
    )
    chart_list1.append(c_temp1)
alt.vconcat(*chart_list1)

Since the valued in columns “used_chip”, “used_pin_number”,”online_order” are only 0 and 1 converted from booling index, creating a new list contains those column names. Then count the value 0 and 1 in each column.

column2=["used_chip","used_pin_number","online_order"]
[df[h].value_counts() for h in column2]
[0.0    649601
 1.0    350399
 Name: used_chip, dtype: int64,
 0.0    899392
 1.0    100608
 Name: used_pin_number, dtype: int64,
 1.0    650552
 0.0    349448
 Name: online_order, dtype: int64]

The value 0 means false, and the value 1 means true. In the factor called “used_chip”, the number of value 0 is larger than the number of value 1, which means most transaction do not through chip (credit card). According to the graph, most “Yes” points are in the “used_chip” value of 0, which means that not using a chip in a transaction increases the likelihood of being fraud.

In the factor called ” used_pin_number”, the number of value 0 is larger than the number of value 1, which means most transaction do not used pin number. According to the graph about the “used_pin_number” and “fraud”, most “Yes” points are in the “used_pin_number” value of 0. That shows that not using pin number in a transaction will increase the probability of being fraud.

In the factor called “online_order”, there are more value 1 than value 0, which means most transaction are online order. According to the graph about the “online_order” and “fraud”, most “Yes” points are in the “used_pin_number” value of 1. That presents that online shopping will increase the fraud transaction.

Summary

By mainly using two main sklearn machines and some altair chart, we get a lot of information about the features of fraud transactions. Then, with those results and data, we can conclude some tips for people to help them reduce the number of fraud transaction.

Though the logistic regression, we can get the coefficient between different features of transaction and fraud transaction. We found that the most influential feature is the using pin number. Therefore, in order to reduce the fraud transaction, people can increase the times of using pin number during the transaction to protect their money from being fraud. Moreover, In the data of value counting, the values in online_order columns shows that there are many online orders currently. Meanwhile, the graph of the “online_order” and “fraud” shows that increasing online orders can increase the number of fraud transaction. Even though the online order is very convenient for people, people still need to be cautious during the online transaction.

References

  • Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

sklearn.preprocessing.StandardScaler

Logistic Regression

sklearn.neighbors.KNeighborsClassifier

Interactivity and Selections

Created in deepnote.com Created in Deepnote