Analyses on NICS Firearm Background Checks

Nathan Samarasena

Course Project, UC Irvine, Math 10, W22

Introduction

For my project, I will be doing analyses on the ‘nics-firearm-background-checks.csv’ file from a BuzzFeed github repository.

I will be exploring how different columns within the data set can better predict certain aspects of given background checks, and whether or not there are better combinations of columns to analyze.

The data set is structured to describe the findings of all background checks done per month per state.

Main portion of the project

Importing Libraries and Data

import seaborn as sns
import numpy as np
import pandas as pd
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss
df = pd.read_csv('nics-firearm-background-checks.csv')
df.shape
(15400, 27)
df.columns
Index(['month', 'state', 'permit', 'permit_recheck', 'handgun', 'long_gun',
       'other', 'multiple', 'admin', 'prepawn_handgun', 'prepawn_long_gun',
       'prepawn_other', 'redemption_handgun', 'redemption_long_gun',
       'redemption_other', 'returned_handgun', 'returned_long_gun',
       'returned_other', 'rentals_handgun', 'rentals_long_gun',
       'private_sale_handgun', 'private_sale_long_gun', 'private_sale_other',
       'return_to_seller_handgun', 'return_to_seller_long_gun',
       'return_to_seller_other', 'totals'],
      dtype='object')
df
month state permit permit_recheck handgun long_gun other multiple admin prepawn_handgun ... returned_other rentals_handgun rentals_long_gun private_sale_handgun private_sale_long_gun private_sale_other return_to_seller_handgun return_to_seller_long_gun return_to_seller_other totals
0 2022-02 Alabama 25401.0 499.0 21822.0 14541.0 1351.0 1260 0.0 13.0 ... 0.0 0.0 0.0 28.0 29.0 2.0 1.0 0.0 0.0 69098
1 2022-02 Alaska 301.0 0.0 2644.0 2178.0 348.0 202 0.0 0.0 ... 0.0 0.0 0.0 2.0 4.0 0.0 0.0 0.0 0.0 5916
2 2022-02 Arizona 2560.0 473.0 20150.0 9935.0 1690.0 1153 0.0 11.0 ... 1.0 0.0 0.0 15.0 13.0 0.0 0.0 2.0 0.0 38149
3 2022-02 Arkansas 1842.0 309.0 7780.0 5756.0 429.0 515 4.0 15.0 ... 0.0 0.0 0.0 5.0 11.0 1.0 0.0 0.0 0.0 19002
4 2022-02 California 15815.0 10550.0 36362.0 23017.0 4941.0 1 0.0 1.0 ... 183.0 0.0 0.0 7638.0 3090.0 626.0 19.0 20.0 0.0 106295
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15395 1998-11 Virginia 0.0 NaN 14.0 2.0 NaN 8 0.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 24
15396 1998-11 Washington 1.0 NaN 65.0 286.0 NaN 8 1.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 361
15397 1998-11 West Virginia 3.0 NaN 149.0 251.0 NaN 5 0.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 408
15398 1998-11 Wisconsin 0.0 NaN 25.0 214.0 NaN 2 0.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 241
15399 1998-11 Wyoming 8.0 NaN 45.0 49.0 NaN 5 0.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 107

15400 rows × 27 columns

df['dt_month'] = pd.to_datetime(df['month']).dt.month
df['dt_year'] = pd.to_datetime(df['month']).dt.year

More Handguns than other Firearms

permit and permit_recheck

df2 = df[df['permit'].notna()]
df2 = df2[df['permit_recheck'].notna()]

df2['more_handgun'] = df['handgun'] > (df['long_gun'] + df['other'])

df2.shape
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  This is separate from the ipykernel package so we can avoid doing imports until
(4015, 30)
df2.columns
Index(['month', 'state', 'permit', 'permit_recheck', 'handgun', 'long_gun',
       'other', 'multiple', 'admin', 'prepawn_handgun', 'prepawn_long_gun',
       'prepawn_other', 'redemption_handgun', 'redemption_long_gun',
       'redemption_other', 'returned_handgun', 'returned_long_gun',
       'returned_other', 'rentals_handgun', 'rentals_long_gun',
       'private_sale_handgun', 'private_sale_long_gun', 'private_sale_other',
       'return_to_seller_handgun', 'return_to_seller_long_gun',
       'return_to_seller_other', 'totals', 'dt_month', 'dt_year',
       'more_handgun'],
      dtype='object')
df2
month state permit permit_recheck handgun long_gun other multiple admin prepawn_handgun ... private_sale_handgun private_sale_long_gun private_sale_other return_to_seller_handgun return_to_seller_long_gun return_to_seller_other totals dt_month dt_year more_handgun
0 2022-02 Alabama 25401.0 499.0 21822.0 14541.0 1351.0 1260 0.0 13.0 ... 28.0 29.0 2.0 1.0 0.0 0.0 69098 2 2022 True
1 2022-02 Alaska 301.0 0.0 2644.0 2178.0 348.0 202 0.0 0.0 ... 2.0 4.0 0.0 0.0 0.0 0.0 5916 2 2022 True
2 2022-02 Arizona 2560.0 473.0 20150.0 9935.0 1690.0 1153 0.0 11.0 ... 15.0 13.0 0.0 0.0 2.0 0.0 38149 2 2022 True
3 2022-02 Arkansas 1842.0 309.0 7780.0 5756.0 429.0 515 4.0 15.0 ... 5.0 11.0 1.0 0.0 0.0 0.0 19002 2 2022 True
4 2022-02 California 15815.0 10550.0 36362.0 23017.0 4941.0 1 0.0 1.0 ... 7638.0 3090.0 626.0 19.0 20.0 0.0 106295 2 2022 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4010 2016-02 Virginia 784.0 0.0 30085.0 15948.0 1133.0 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 47955 2 2016 True
4011 2016-02 Washington 15736.0 0.0 20583.0 11991.0 1832.0 863 1.0 3.0 ... 578.0 422.0 30.0 5.0 15.0 0.0 56043 2 2016 True
4012 2016-02 West Virginia 3527.0 0.0 10746.0 7436.0 357.0 757 5.0 6.0 ... 11.0 5.0 1.0 3.0 2.0 0.0 27216 2 2016 True
4013 2016-02 Wisconsin 9420.0 0.0 19465.0 12431.0 821.0 62 0.0 0.0 ... 5.0 15.0 0.0 0.0 0.0 0.0 42855 2 2016 True
4014 2016-02 Wyoming 551.0 0.0 2287.0 2036.0 139.0 150 0.0 3.0 ... 0.0 4.0 0.0 1.0 1.0 0.0 5703 2 2016 True

4015 rows × 30 columns

X_colnames = ['permit','permit_recheck']
y_colname = 'more_handgun'
X = df2.loc[:,X_colnames].copy()
y = df2.loc[:,y_colname].copy()
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

clf = KNeighborsClassifier(n_neighbors = 10)
clf.fit(X_scaled,y)
KNeighborsClassifier(n_neighbors=10)
X_scaled_train, X_scaled_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.4)

clf2 = KNeighborsClassifier(n_neighbors = 10)
clf2.fit(X_scaled_train,y_train)
KNeighborsClassifier(n_neighbors=10)
probs = clf2.predict_proba(X_scaled_test)
log_loss(y_test,probs)
0.7955703072201001
probs
array([[0.2, 0.8],
       [0.3, 0.7],
       [0.1, 0.9],
       ...,
       [0.2, 0.8],
       [0.4, 0.6],
       [0.1, 0.9]])
df2['probsSer'] = pd.Series(probs[:,1])

private_sale_handgun an private_sale_not_handgun

df3 = df[df['private_sale_handgun'].notna()]
df3 = df3[df['handgun'].notna()]
df3 = df3[df['long_gun'].notna()]
df3 = df3[df['other'].notna()]

df3['more_handgun'] = df['handgun'] > (df['long_gun'] + df['other'])

df3.shape
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  This is separate from the ipykernel package so we can avoid doing imports until
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:4: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  after removing the cwd from sys.path.
(5665, 30)
df3['private_sale_not_handgun'] = df2.loc[:,'private_sale_long_gun'] + df2.loc[:,'private_sale_other']
## X2_colnames = ['private_sale_handgun','private_sale_long_gun','private_sale_other']
X2_colnames = ['private_sale_handgun','private_sale_not_handgun']
y2_colname = 'more_handgun'
X2 = df2.loc[:,X_colnames].copy()
y2 = df2.loc[:,y_colname].copy()
scaler = StandardScaler()
scaler.fit(X2)
X2_scaled = scaler.transform(X2)

clf3 = KNeighborsClassifier(n_neighbors = 10)
clf3.fit(X2_scaled,y2)
KNeighborsClassifier(n_neighbors=10)
X2_scaled_train, X2_scaled_test, y2_train, y2_test = train_test_split(X2_scaled,y2,test_size=0.4)

clf4 = KNeighborsClassifier(n_neighbors = 10)
clf4.fit(X2_scaled_train,y2_train)
KNeighborsClassifier(n_neighbors=10)
probs2 = clf4.predict_proba(X2_scaled_test)
log_loss(y2_test,probs2)
0.8214485407353612
df3['probs2Ser'] = pd.Series(probs2[:,1])

State

handgun and long_gun

df4 = df[df['state'].notna()]
df4 = df4[df4['handgun'].notna()]
df4 = df4[df4['long_gun'].notna()]

df4.shape
(15380, 29)
X3_colnames = ['handgun','long_gun']
y3_colname = 'state'
X3 = df4.loc[:,X3_colnames].copy()
y3 = df4.loc[:,y3_colname].copy()
scaler = StandardScaler()
scaler.fit(X3)
X3_scaled = scaler.transform(X3)

clf5 = KNeighborsClassifier(n_neighbors = 5)
clf5.fit(X3_scaled,y3)
KNeighborsClassifier()
X3_scaled_train, X3_scaled_test, y3_train, y3_test = train_test_split(X3_scaled,y3,test_size=0.4)

clf6 = KNeighborsClassifier(n_neighbors = 5)
clf6.fit(X3_scaled_train,y3_train)
KNeighborsClassifier()
probs3 = clf6.predict_proba(X3_scaled_test)
log_loss(y3_test,probs3)
16.72825544530875

Graphing

permit and permit_recheck for predicting when there are more handguns

alt.data_transformers.disable_max_rows()
graph = alt.Chart(df2).mark_bar().encode(
    x = 'more_handgun',
    y = 'permit',
    color = 'more_handgun',
    tooltip = ['more_handgun']
)

graph2 = alt.Chart(df2).mark_bar().encode(
    x = 'more_handgun',
    y = 'permit_recheck',
    color = 'more_handgun',
    tooltip = ['more_handgun']
)

graph|graph2
alt.data_transformers.disable_max_rows()
graph3 = alt.Chart(df2).mark_bar().encode(
    x = 'more_handgun',
    y = 'permit',
    color = 'probsSer',
    tooltip = ['probsSer','more_handgun']
)

alt.data_transformers.disable_max_rows()
graph4 = alt.Chart(df2).mark_bar().encode(
    x = 'more_handgun',
    y = 'permit',
    color = 'probsSer',
    tooltip = ['probsSer','more_handgun']
)

graph3|graph4

private_sale_handgun an private_sale_not_handgun for predicting when there are more handguns

alt.data_transformers.disable_max_rows()
graph5 = alt.Chart(df3).mark_bar().encode(
    x = 'more_handgun',
    y = 'private_sale_handgun',
    color = 'more_handgun',
    tooltip = ['more_handgun']
)

graph6 = alt.Chart(df3).mark_bar().encode(
    x = 'more_handgun',
    y = 'private_sale_not_handgun',
    color = 'more_handgun',
    tooltip = ['more_handgun']
)

graph5|graph6
alt.data_transformers.disable_max_rows()
graph7 = alt.Chart(df3).mark_bar().encode(
    x = 'more_handgun',
    y = 'private_sale_handgun',
    color = 'probs2Ser',
    tooltip = ['probs2Ser','more_handgun']
)

alt.data_transformers.disable_max_rows()
graph8 = alt.Chart(df3).mark_bar().encode(
    x = 'more_handgun',
    y = 'private_sale_not_handgun',
    color = 'probs2Ser',
    tooltip = ['probs2Ser','more_handgun']
)

graph7|graph8

Accompanying Documentation

###Importing Libraries and Data

To start, we will import all required libraries.

Next, we will import the nics-firearm-background-checks.csv file into the project to be analyzed, initializing it as a DataFrame df. We take note of the shape as well as the columns of df.

We will also create some datetime columns to be used later in the project.

More Handguns than other Firearms

From there, we will create a new DataFrame for each new comparison between columns, as we may need to remove different NA data rows depending on what we are analyzing.

To start, we will find a good option to predict whether the given background checks have more handguns as opposed to long guns and other types of firearms. We use KNeighborsClassifier for this data. My initial idea was to use how many permits and permit rechecks were seen in the background checks, but my log_loss was fairly high, with the number being around 0.9, so this might not be the best for predicting when the background checks have more handguns.

This time we try to use private_sale_handgun and a new column that we just made called private_sale_not_handgun. After running through it just like before with KNeighborsClassifier, we get a lower log_loss of approximately 0.75.

State

To predict state, I went with

When trying to predict the state from the data of background checks, we run into a few problems. Namely, based on the nature of how states’ populations tend to treat gun control and the stigma around owning firearms, it is highly likely that our program would mistake many states for others. We also cannot replace one of the comparative columns with another, as they will yield similar results.

Graphing

For graphing, I decided to show what each column looked like with respect to what we were trying to predict. I used a bar charts through Altair.

Earlier in the project I also created columns to represent the probability for there to be more handguns with probsSer and probs2Ser to allow for a bar graph that would show more insightful information for each data point.

Summary

Through this project, I unfortunately found that making predictions based on the data provided is much more difficult than I initially thought. The log_loss was the biggest indicator of whether or not the insight provided by using machine learning was useful, as the numbers were uncomfortably high even after adjusting test size and the n_neighbors significantly (within reason for the size of the data). However, the use of probsSer told me a lot about how well the machine learning works when compared directly with the data.