Fraudulent And Non-fraudulent Transactions in Credit Cards

Fraudulent And Non-fraudulent Transactions in Credit Cards¶

Author:Yadi Wu

Course Project, UC Irvine, Math 10, S2

Introduction¶

The data set used in this experiment is the data of credit card transactions in September 2013 in a certain place in Europe. This dataset contains two days of transaction data, there are only 492 fraudulent transactions out of all 284807 transactions, which first brings us a data imbalance problem, with positive examples (fraudulent transactions) accounting for 0.172% of the total transactions. Among them, there are 28 columns of desensitization features, named V1, V2, … V28, which are all obtained by PCA (principal component analysis), and the only two features that are not dimensionally reduced by PCA are “Time” and “Amount” “, the label is “Class”. The feature ‘Time’ represents the elapsed time of each card swipe relative to the first swipe of each card in the dataset, in seconds. The label is “Class”, a value of 1 represents a fraudulent transaction, and a value of 0 represents a normal transaction.

!pip install --upgrade pip

Collecting pip
  Downloading pip-22.1.2-py3-none-any.whl (2.1 MB)
     |████████████████████████████████| 2.1 MB 19.1 MB/s 
?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 20.1.1
    Uninstalling pip-20.1.1:
      Successfully uninstalled pip-20.1.1
Successfully installed pip-22.1.2

!pip install imblearn==0.0

Collecting imblearn==0.0
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.9.1-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 kB 27.6 MB/s eta 0:00:00
?25hRequirement already satisfied: threadpoolctl>=2.0.0 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from imbalanced-learn->imblearn==0.0) (3.1.0)
Requirement already satisfied: numpy>=1.17.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from imbalanced-learn->imblearn==0.0) (1.21.6)
Requirement already satisfied: scipy>=1.3.2 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from imbalanced-learn->imblearn==0.0) (1.7.3)
Requirement already satisfied: joblib>=1.0.0 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from imbalanced-learn->imblearn==0.0) (1.1.0)
  Downloading imbalanced_learn-0.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.1/199.1 kB 37.0 MB/s eta 0:00:00
?25hRequirement already satisfied: scikit-learn>=1.0.1 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from imbalanced-learn->imblearn==0.0) (1.0.2)
Installing collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.9.0 imblearn-0.0

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import altair as alt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import recall_score, accuracy_score, confusion_matrix, classification_report,roc_auc_score
from sklearn import model_selection
from sklearn import metrics
import imblearn
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split

#load data
df = pd.read_csv("creditcard.csv")
df.head()

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V21	V22	V23	V24	V25	V26	V27	V28	Amount
0	0.0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	...	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62
1	0.0	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	...	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69
2	1.0	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	...	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66
3	1.0	-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	...	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	123.50
4	2.0	-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	...	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	69.99

5 rows × 31 columns

#shape of data
print(df.shape)

(284807, 31)

#Check if there are missing values
print(df.isna().sum())

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

#重复值检测
any(df.duplicated())

True

#去除重复值
df.drop_duplicates(inplace=True)

#异常值检测
fig , ax = plt.subplots(figsize=(18,12))
plt.title('箱线图异常值检测',fontsize=15)
sns.boxplot(data=df.iloc[:,1:-2],ax=ax)
plt.show()

#Descriptive statistical analysis of data
df.describe().T

	count	mean	std	min	25%	50%	75%	max
Time	283726.0	94811.077600	47481.047891	0.000000	54204.750000	84692.500000	139298.000000	172792.000000
V1	283726.0	0.005917	1.948026	-56.407510	-0.915951	0.020384	1.316068	2.454930
V2	283726.0	-0.004135	1.646703	-72.715728	-0.600321	0.063949	0.800283	22.057729
V3	283726.0	0.001613	1.508682	-48.325589	-0.889682	0.179963	1.026960	9.382558
V4	283726.0	-0.002966	1.414184	-5.683171	-0.850134	-0.022248	0.739647	16.875344
V5	283726.0	0.001828	1.377008	-113.743307	-0.689830	-0.053468	0.612218	34.801666
V6	283726.0	-0.001139	1.331931	-26.160506	-0.769031	-0.275168	0.396792	73.301626
V7	283726.0	0.001801	1.227664	-43.557242	-0.552509	0.040859	0.570474	120.589494
V8	283726.0	-0.000854	1.179054	-73.216718	-0.208828	0.021898	0.325704	20.007208
V9	283726.0	-0.001596	1.095492	-13.434066	-0.644221	-0.052596	0.595977	15.594995
V10	283726.0	-0.001441	1.076407	-24.588262	-0.535578	-0.093237	0.453619	23.745136
V11	283726.0	0.000202	1.018720	-4.797473	-0.761649	-0.032306	0.739579	12.018913
V12	283726.0	-0.000715	0.994674	-18.683715	-0.406198	0.139072	0.616976	7.848392
V13	283726.0	0.000603	0.995430	-5.791881	-0.647862	-0.012927	0.663178	7.126883
V14	283726.0	0.000252	0.952215	-19.214325	-0.425732	0.050209	0.492336	10.526766
V15	283726.0	0.001043	0.914894	-4.498945	-0.581452	0.049299	0.650104	8.877742
V16	283726.0	0.001162	0.873696	-14.129855	-0.466860	0.067119	0.523512	17.315112
V17	283726.0	0.000170	0.842507	-25.162799	-0.483928	-0.065867	0.398972	9.253526
V18	283726.0	0.001515	0.837378	-9.498746	-0.498014	-0.002142	0.501956	5.041069
V19	283726.0	-0.000264	0.813379	-7.213527	-0.456289	0.003367	0.458508	5.591971
V20	283726.0	0.000187	0.769984	-54.497720	-0.211469	-0.062353	0.133207	39.420904
V21	283726.0	-0.000371	0.723909	-34.830382	-0.228305	-0.029441	0.186194	27.202839
V22	283726.0	-0.000015	0.724550	-10.933144	-0.542700	0.006675	0.528245	10.503090
V23	283726.0	0.000198	0.623702	-44.807735	-0.161703	-0.011159	0.147748	22.528412
V24	283726.0	0.000214	0.605627	-2.836627	-0.354453	0.041016	0.439738	4.584549
V25	283726.0	-0.000232	0.521220	-10.295397	-0.317485	0.016278	0.350667	7.519589
V26	283726.0	0.000149	0.482053	-2.604551	-0.326763	-0.052172	0.240261	3.517346
V27	283726.0	0.001763	0.395744	-22.565679	-0.070641	0.001479	0.091208	31.612198
V28	283726.0	0.000547	0.328027	-15.430084	-0.052818	0.011288	0.078276	33.847808
Amount	283726.0	88.472687	250.399437	0.000000	5.600000	22.000000	77.510000	25691.160000
Class	283726.0	0.001667	0.040796	0.000000	0.000000	0.000000	0.000000	1.000000

Polynomial Regression

df['Class'].value_counts().plot.barh()
plt.title("Number of fraudulent and legitimate transactions")
df['Class'].value_counts()

0    283253
1       473
Name: Class, dtype: int64

The above code is used to help me observe, understand and organize the data

KMeans Clustering¶

Through the classification of Class, we know that among the 280,000 data, there are only more than 400 fraudulent transactions. And first, I want to Kmeans Cluster them by applying the principal components of v1 to v28. See if I can find these frauds in the 280,000 data through the cluster.

By observing the boxplot we created, we can find that there are many outliers in v1 to v28,, so in order to solve these outliers and not let it affect our clustering, I choose to use standardscaler to solve this problem. -

from sklearn.cluster import KMeans
max_deg = 28
cols = [f"V{i}" for i in range(1, max_deg+1)]
scaler = StandardScaler()
scaler.fit(df[cols])
df[cols] = scaler.transform(df[cols])
kmeans = KMeans(n_clusters = 2)
kmeans.fit(df[cols])
df["Cluster"] = kmeans.predict(df[cols])
df["Cluster"].value_counts()

1    244956
0     38770
Name: Cluster, dtype: int64

df

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V22	V23	V24	V25	V26	V27	V28	Amount	Class	Cluster
0	0.0	-0.701082	-0.041687	1.680101	0.976623	-0.247020	0.348012	0.193700	0.084434	0.333534	...	0.383483	-0.177444	0.110157	0.247059	-0.392622	0.333033	-0.065850	149.62	0	1
1	0.0	0.608792	0.164138	0.109279	0.318998	0.042258	-0.060980	-0.065656	0.072903	-0.231703	...	-0.881454	0.162081	-0.561503	0.321175	0.260854	-0.027154	0.043219	2.69	0	1
2	1.0	-0.700336	-0.811337	1.174270	0.270648	-0.366756	1.352655	0.643223	0.210788	-1.381169	...	1.065068	1.457772	-1.138484	-0.628161	-0.288861	-0.144325	-0.183824	378.66	0	0
3	1.0	-0.499064	-0.109972	1.187383	-0.608355	-0.008814	0.937245	0.192079	0.320843	-1.264664	...	0.007299	-0.305465	-1.941446	1.242487	-0.460694	0.154039	0.185687	123.50	0	0
4	2.0	-0.597606	0.535539	1.025470	0.287092	-0.297036	0.072873	0.481517	-0.228725	0.747917	...	1.101780	-0.220709	0.232904	-0.394800	1.041677	0.550001	0.654234	69.99	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
284802	172786.0	-6.102103	6.118855	-6.519873	-1.459282	-3.897079	-1.956335	-4.007632	6.196662	1.749010	...	0.154412	1.626230	-0.841382	2.757072	0.518377	2.380049	2.509507	0.77	0	1
284803	172787.0	-0.379208	-0.030938	1.347812	-0.520175	0.629193	0.795504	0.018351	0.250814	0.535282	...	1.275826	0.019665	-1.678330	-1.163409	-0.820253	0.168567	-0.164849	24.79	0	1
284804	172788.0	0.982354	-0.180433	-2.155033	-0.392355	1.908988	2.276699	-0.243249	0.601561	0.396215	...	0.798074	-0.060444	1.056626	0.510299	-0.181557	0.006802	-0.082640	67.88	0	1
284805	172788.0	-0.126465	0.324660	0.464577	0.489870	-0.275808	0.469130	-0.560399	0.576734	0.359367	...	1.104223	-0.262138	0.203081	-1.091530	1.133734	0.270523	0.317004	10.00	0	1
284806	172792.0	-0.276860	-0.112709	0.465125	-0.355898	-0.010438	-0.486871	1.283094	-0.350956	0.445258	...	0.887577	0.603781	0.014172	-0.908286	-1.697776	-0.010558	0.039941	217.00	0	1

283726 rows × 32 columns

After clustering, I need to check that when Class is 1, Cluster is also equal to 1. This can help me determine whether using cluster to identify fraud is an effective method.

df_fraud = df[df["Class"]==1]
df_result = df_fraud[df_fraud["Class"] !=df_fraud["Cluster"]]
df_result.shape

(102, 32)

Through the data, it can be known that when Class is equal to 1, only 4 rows in the cluster are also equal to 1, that is to say, through the cluster, it cannot help us find fraudulent behaviors through data features..

Polynomial Regression¶

Because the database I have chosen is too large. So I selected the fraction trade part for polynomial regression. I will analyze the linear regression of Time and amount in the Fraudent trade.

df_fraud

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V22	V23	V24	V25	V26	V27	V28	Amount	Class	Cluster
541	406.0	-1.189998	1.187907	-1.068129	2.829108	-0.380547	-1.070182	-2.068312	1.181043	-2.527172	...	-0.048353	-0.746205	0.528353	0.085859	0.368612	0.655430	-0.438451	0.00	1	1
623	472.0	-1.565412	-1.914843	0.720398	1.620450	0.986181	-0.798604	0.263732	-0.056774	-0.245878	...	0.601053	2.205812	-0.485477	0.537260	-0.301858	-0.643185	0.107360	529.00	1	1
4920	4462.0	-1.185441	1.070858	-0.239519	1.649866	-0.598005	-0.056045	0.456575	-0.337807	-0.216028	...	-1.286836	0.276620	-0.144551	-0.299072	-1.125972	0.095524	-0.468183	239.93	1	1
6108	6986.0	-2.260698	0.827413	-1.719688	1.897036	-0.820591	-1.280397	-2.849317	-0.210273	-0.224714	...	0.244266	-0.699702	-0.088695	0.484705	-1.364245	-2.094537	2.588289	59.00	1	1
6329	7519.0	0.630546	1.836324	-2.854291	3.348765	2.630616	-1.018528	1.394231	-0.420256	-1.169578	...	-0.971868	-1.053394	-2.696166	2.857019	1.175491	-0.029765	0.445835	1.00	1	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
279863	169142.0	-0.992699	0.686092	-2.995961	1.239062	-1.138931	-1.508606	-0.720598	0.592056	-1.883493	...	-0.440514	1.024883	-0.487263	1.031686	1.635188	0.735115	0.449417	390.00	1	1
280143	169347.0	0.704633	0.785520	-3.318041	1.000449	0.320081	-0.995096	-1.152574	0.211509	-1.027667	...	0.038989	-0.233828	-0.134181	1.001703	1.533689	0.978889	0.567300	0.76	1	1
280149	169351.0	-0.350129	0.686525	-1.468379	0.333249	-0.815079	-0.001657	-1.821788	1.027107	-0.593938	...	1.151230	0.305829	0.052600	-1.418716	0.976993	0.968670	0.590850	77.89	1	1
281144	169966.0	-1.601495	0.358292	-3.580180	1.287005	-0.611796	-2.209134	-1.800009	0.898677	-1.488591	...	-0.371533	-0.731610	-0.303609	-0.629170	1.257056	2.231529	-0.775084	245.00	1	1
281674	170348.0	1.019526	0.098749	-1.713455	0.291077	0.834651	-0.071742	0.180220	-0.057274	0.528919	...	-0.407316	-0.116034	-0.743818	0.601472	-0.601109	0.003094	-0.048338	42.53	1	1

473 rows × 32 columns

scaler = StandardScaler()
sub = ["Time","Amount"]
scaler.fit(df_fraud[sub])
df_fraud[sub] = scaler.transform(df_fraud[sub])
df_fraud

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V22	V23	V24	V25	V26	V27	V28	Amount	Class	Cluster
541	-1.647524	-1.189998	1.187907	-1.068129	2.829108	-0.380547	-1.070182	-2.068312	1.181043	-2.527172	...	-0.048353	-0.746205	0.528353	0.085859	0.368612	0.655430	-0.438451	-0.476548	1	1
623	-1.646165	-1.565412	-1.914843	0.720398	1.620450	0.986181	-0.798604	0.263732	-0.056774	-0.245878	...	0.601053	2.205812	-0.485477	0.537260	-0.301858	-0.643185	0.107360	1.558570	1	1
4920	-1.564041	-1.185441	1.070858	-0.239519	1.649866	-0.598005	-0.056045	0.456575	-0.337807	-0.216028	...	-1.286836	0.276620	-0.144551	-0.299072	-1.125972	0.095524	-0.468183	0.446488	1	1
6108	-1.512090	-2.260698	0.827413	-1.719688	1.897036	-0.820591	-1.280397	-2.849317	-0.210273	-0.224714	...	0.244266	-0.699702	-0.088695	0.484705	-1.364245	-2.094537	2.588289	-0.249569	1	1
6329	-1.501120	0.630546	1.836324	-2.854291	3.348765	2.630616	-1.018528	1.394231	-0.420256	-1.169578	...	-0.971868	-1.053394	-2.696166	2.857019	1.175491	-0.029765	0.445835	-0.472701	1	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
279863	1.825501	-0.992699	0.686092	-2.995961	1.239062	-1.138931	-1.508606	-0.720598	0.592056	-1.883493	...	-0.440514	1.024883	-0.487263	1.031686	1.635188	0.735115	0.449417	1.023822	1	1
280143	1.829720	0.704633	0.785520	-3.318041	1.000449	0.320081	-0.995096	-1.152574	0.211509	-1.027667	...	0.038989	-0.233828	-0.134181	1.001703	1.533689	0.978889	0.567300	-0.473624	1	1
280149	1.829803	-0.350129	0.686525	-1.468379	0.333249	-0.815079	-0.001657	-1.821788	1.027107	-0.593938	...	1.151230	0.305829	0.052600	-1.418716	0.976993	0.968670	0.590850	-0.176897	1	1
281144	1.842461	-1.601495	0.358292	-3.580180	1.287005	-0.611796	-2.209134	-1.800009	0.898677	-1.488591	...	-0.371533	-0.731610	-0.303609	-0.629170	1.257056	2.231529	-0.775084	0.465992	1	1
281674	1.850323	1.019526	0.098749	-1.713455	0.291077	0.834651	-0.071742	0.180220	-0.057274	0.528919	...	-0.407316	-0.116034	-0.743818	0.601472	-0.601109	0.003094	-0.048338	-0.312931	1	1

473 rows × 32 columns

First let’s look at the graph, we can see that the relationship between them doesn’t look linear. So, I choose to do a polynomial regression.

cols2 = []
for deg in range(1,10):
    cols2.append(f"t{deg}")
    df_fraud[f"t{deg}"] = df_fraud["Time"]**deg

reg = LinearRegression()
reg.fit(df_fraud[cols2], df_fraud["Amount"])
df_fraud["Pred"] = reg.predict(df_fraud[cols2])

c = alt.Chart(df_fraud).mark_circle().encode(
    x="Time",
    y="Amount"
)
c_poly = alt.Chart(df_fraud).mark_line(color="red", clip=True).encode(
    x="Time",
    y="Pred"
)

c+c_poly    

We found that the number of fraudulent transactions is lower than the number of non-fraudulent transactions (492). Actually, this seems to be an imbalanced dataset.

What is an imbalanced dataset¶

If any dataset has a large difference between positive and negative values, it is called an imbalanced dataset. In other words, a dataset where the number of records per (any) target class varies widely is called an imbalanced dataset. For example, in our dataset, we have a total of 284,807 records, but only 492 belong to class 1, which clearly shows that our dataset is unbalanced.

What can it cause¶

It can be seen that the number of records for class 1 is very small compared to class 0, so we cannot correctly classify class 1, which is a problem caused by the imbalance of the dataset. In order to deal with imbalanced datasets or make our model work well on imbalanced datasets, we need to measure the performance of the model in a well descriptive way, we can’t just measure the accuracy of the model, Since we have 99% of our data from class 0, it doesn’t matter how well our model performs on the minore class, its accuracy is always very low. Therefore, in the case of imbalanced data, we should choose estimators wisely! Metrics like confusion measures and classification reports give us correct estimates even when the dataset is unbalanced.

Processing method¶

Random sampling: One of the most widely used techniques for dealing with imbalanced data (making it balanced) is random sampling. Random sampling There are two main ways to randomly downsample. This involves randomly selecting examples from the majority class and removing them from the training dataset. Random upsampling. Random oversampling involves randomly picking samples from the minority class, replacing them, and adding them to the training dataset.

Synthetic Minority Supersampling Technique (SMOT): SMOTE first randomly selects a minority class instance a and finds its k nearest minority class neighbors. Synthetic instances are then created by randomly selecting one of the k nearest neighbors b and connecting a and b to form line segments in the feature space. The composite instance is generated as a convex combination of the two selected instances a and b. In addition to these methods, the emsemble method is also very useful in dealing with imbalanced datasets. We will use SMOTE for our dataset, as it is better to generate new data by sampling rather than just copying and deleting current data.

x = df.drop("Class",axis=1) #x will contain all data columns except class`
y = df['Class']
print(x.shape,y.shape)

(283726, 31) (283726,)

#Before #smote
y.value_counts().plot.bar()

<AxesSubplot:>

smote = SMOTE(random_state=42)
X,Y  = smote.fit_resample(x, y)
print("After SMOTE")
Y.value_counts().plot.bar()

After SMOTE

<AxesSubplot:>

sample=pd.concat([X,Y],axis=1).sample(10000,random_state=123)
sample

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V22	V23	V24	V25	V26	V27	V28	Amount	Cluster	Class
442331	77172.237657	0.540937	0.790502	-0.849117	1.508323	0.596676	-0.842425	0.319839	-0.083568	-0.051690	...	-0.917181	-0.174921	-0.048331	1.219082	-0.675134	0.264111	0.385684	1.000000	1	1
158471	112183.000000	0.995435	-0.402174	-0.191834	0.368670	-0.735063	-0.393922	-0.638201	0.069545	1.441512	...	0.724353	0.273253	-0.133346	-0.697353	1.204531	-0.087840	-0.151104	28.750000	1	0
93113	64421.000000	0.010810	-2.128306	-0.521554	-0.390150	-1.610271	-0.594810	0.214228	-0.374345	-1.561731	...	-0.827589	-1.212313	0.671466	0.672146	-0.296187	-0.355914	0.431251	792.000000	0	0
60869	49670.000000	0.464578	0.173652	0.296629	2.013869	-0.064909	0.272325	-0.084984	0.219016	-0.426335	...	-0.010324	-0.181532	0.221520	0.876706	0.148532	0.044308	0.135465	75.080000	1	0
60103	49287.000000	-0.264812	-1.895529	0.134618	0.616322	-1.167512	0.897394	0.118770	0.258287	0.780149	...	-0.754793	-0.942907	-0.313470	-0.685566	1.798187	-0.410245	0.424572	828.550000	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
223306	143665.000000	0.841603	-0.355296	-2.650655	0.247401	2.292331	2.217603	0.625064	0.345962	-0.423122	...	1.150275	-0.619937	1.260476	1.598715	-0.412681	-0.201462	-0.140284	232.150000	1	0
550235	129386.841056	0.143811	0.902407	-2.546283	2.530591	-0.467587	-0.353537	-1.494261	0.791248	-0.459440	...	0.885070	0.112711	0.328755	-0.519594	-0.113653	0.907097	0.302719	111.690586	1	1
234211	148243.000000	-0.305571	0.530397	0.049392	-0.694545	0.599879	0.176063	0.931478	-0.309567	0.831811	...	0.746176	-0.476628	0.545621	0.036709	1.030441	-0.318004	-0.650986	35.970000	1	0
294575	43245.220256	-1.907073	1.322483	-3.173414	1.258307	-2.329949	-0.990770	-2.569501	-0.569741	-0.852165	...	-0.259052	0.370714	-0.146394	0.377387	-1.485624	-0.163583	-2.004174	269.933482	1	1
364614	154625.031140	-0.347385	2.739970	-4.371548	4.725430	0.680820	-1.373242	-1.769484	0.816217	-4.321947	...	-0.394643	-0.460494	-1.724119	-0.085912	1.161034	1.817458	1.122139	0.769328	1	1

10000 rows × 32 columns

# split training set and test set
X_train,X_test,y_train,y_test = train_test_split(sample.drop(['Time','Class'],axis=1),sample['Class'],random_state=42,test_size=.25)

plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), cmap='tab20_r',linewidths=.5,annot=False)

<AxesSubplot:>

Heat map¶

Using heat map, you can see the similarity of multiple features in the data table.The darker the color, the stronger the correlation

logistic regression model¶

#K-fold cross-validation
C= [1.0,3.0,6.0,8.0,10.0]
# Build an empty list to store the average accuracy
accuracy = []
per_accuracy=[]
for k in C:
    cv_result = model_selection.cross_val_score(LogisticRegression(C=k,class_weight=None,dual=False,fit_intercept=True,max_iter=100,penalty='l1',solver='liblinear'), 
                                              X_train, y_train, cv = 10, scoring='recall')
    accuracy.append(cv_result.mean())
    per_accuracy.append(cv_result)

Building an empty list to store the average accuracy,and then using the 10-fold cross-validation method to compare the prediction accuracy of the model under each k value. Selecting the subscript corresponding to the maximum value from the k average accuracies, and then ploting a line graph between different K values and average prediction accuracy

accuracy

[0.9632642123922992,
9645765483503045,
9645765483503045,
9645765483503045,
9645765483503045]

df1 = pd.DataFrame({"C":[1.0,3.0,6.0,8.0,10.0],"accuracy":[0.963001745200698,0.9643140811587033,0.9648390155419054,0.9651007956466175,0.9651007956466175]})

d1 = alt.Chart(df1).mark_line().encode(
    x = "C",
    y = alt.X("accuracy", scale = alt.Scale(domain=(0.962,0.966))),
    tooltip = ["C","accuracy"]
)
d2 = alt.Chart(df1).mark_point().encode(
    x = "C",
    y = alt.X("accuracy", scale = alt.Scale(domain=(0.962,0.966))),
    tooltip = ["C","accuracy"]
)
d1+d2

According to this image, we can clearly see that the best value is 10, so next, I will use c=3 and the logistic knowledge I have learned as well as my own personal research to build a logistic regression model. I will apply confusion matrix, classification_report, accuracy_score and recall_score.

#build logistic regression model
clf = LogisticRegression(C=10,solver='liblinear')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
acc_lr = accuracy_score(y_test, y_pred)
recall_lr=recall_score(y_test, y_pred, average='macro')
# confusion matrix
conf = confusion_matrix(y_test, y_pred)
#model evaluation report
clf_report = classification_report(y_test, y_pred)
print(f'The accuracy of the logistic regression model is:{acc_lr}')
print(f'Recall of Logistic Regression Models:{recall_lr}')
#Result visualization
sns.heatmap(conf,annot = True, cmap = 'GnBu')
#After a few tries, I found that annot=true is very important.
#Because without this, the numbers we want to see on the confusion table won't appear.
plt.title('The confusion matrix for the logistic regression model is：')
plt.show()
print('Model Evaluation Report for Logistic Regression Models：')
print(clf_report)

The accuracy of the logistic regression model is:0.9724
Recall of Logistic Regression Models:0.9725358422939068

Model Evaluation Report for Logistic Regression Models：
              precision    recall  f1-score   support

           0       0.96      0.99      0.97      1240
           1       0.99      0.96      0.97      1260

    accuracy                           0.97      2500
   macro avg       0.97      0.97      0.97      2500
weighted avg       0.97      0.97      0.97      2500

confusion matrix:¶

The sum of the main diagonals in the confusion matrix represents the number of correct predictions, and the sum of the negative diagonals represents the number of incorrect predictions

Accuracy:¶

It is the number of values that are accurately predicted by the model (positive or negative). Accuracy = （0，0）+（1，1）/ (（0，0)+（0,1)+(1,0)+(1,1). The accuracy of our model comes out to be 0.9716.

Recall:¶

It is the number of True Positives correctly predicted w.r.t. total number of positives. Also called TPR. recall = (1,1)/((0,1)+(1,1)) and (0,0)/((0,0)+(1,0)). .

Precision¶

That is how actual converted values are classified as converted. Precision = (0,0)/((0,1)+(0,0))and(1,1)/((1,1)+(1,0)).

Specificity:¶

It is the number of True Negatives correctly predicted w.r.t. total number of negatives. Also expressed as (1-FPR). （1，1） / (（1，1）+(0,1))and(0,0)/((0,0)+(1,0)).

micro avg:¶

Calculate the indicator values under all data, for example, 3 out of 5 samples of all data are correctly predicted, so the micro avg is 0.6

macro avg:¶

The unweighted average of each category evaluation index, such as the macro avg of the accuracy, (0.50+0.00+1.00)/3=0.5

weighted avg:¶

weighted average, such as the calculation method of the first value

fpr,tpr,threshold =metrics.roc_curve(y_test,clf.predict_proba(X_test)[:,1])
# Calculate true rate and false positive rate
roc_df = pd.DataFrame()
roc_df['fpr'] = fpr
roc_df['tpr'] = tpr
roc_df['threshold'] = threshold
roc_df

	fpr	tpr	threshold
0	0.000000	0.000000	2.000000e+00
1	0.000000	0.384921	1.000000e+00
2	0.000000	0.400000	1.000000e+00
3	0.000000	0.411111	1.000000e+00
4	0.000000	0.418254	1.000000e+00
...	...	...	...
133	0.662903	0.998413	6.264739e-03
134	0.662903	0.999206	6.262106e-03
135	0.679839	0.999206	5.616998e-03
136	0.679839	1.000000	5.602523e-03
137	1.000000	1.000000	1.408001e-24

138 rows × 3 columns

Create a new pd database and put the data we got into this pd dataframe. Then use df to build the roc graph.

roc_line = alt.Chart(roc_df).mark_line(color = 'red').encode(
    x=alt.X('fpr', title="false positive rate"),
    y=alt.Y('tpr', title="true positive rate")
)

roc_area = alt.Chart(roc_df).mark_area(fillOpacity = 0.5, fill = 'red').encode(
    x=alt.X('fpr', title="false positive rate"),
    y=alt.Y('tpr', title="true positive rate")
)

baseline = alt.Chart(roc_df).mark_line(strokeDash=[20,5], color = 'black').encode(
    x=alt.X('threshold', scale = alt.Scale(domain=[0, 1]), title=None),
    y=alt.Y('threshold', scale = alt.Scale(domain=[0, 1]), title=None)
)

roc_line + roc_area + baseline.properties(
    title='Titanic survivor data ROC curve').interactive()

Summary¶

This is all the code I did. First, I analyzed the 280,000 data to find if there are duplicate values, na values and outliers.According to the data, we know that there are only 400 fraudulent behaviors in the 280,000 data, so I learned imbalance to balance my data. Then use kmeans cluster and polynomial regression for data classification and linear regression. Finally, a logistic regression model is established and its validity and accuracy are tested.

reference：¶

Confusion Matrix： https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html

k-fold validation：https://vitalflux.com/k-fold-cross-validation-python-example:~:text= K-fold Cross-Validation with Python (using Cross-Validation Generators),4 LeaveOneGroupOut 5 LeavePGroupsOut 6 PredefinedSplit More

Roc Line： https://www.statology.org/plot-roc-curve-python/

Created in Deepnote

NBA Salaries from the 2017 Season

Sorting the number of planets

UC Irvine Math 10 S22