Analyses on NICS Firearm Background Checks
Contents
Analyses on NICS Firearm Background Checks¶
Nathan Samarasena
Course Project, UC Irvine, Math 10, W22
Introduction¶
For my project, I will be doing analyses on the ‘nics-firearm-background-checks.csv’ file from a BuzzFeed github repository.
I will be exploring how different columns within the data set can better predict certain aspects of given background checks, and whether or not there are better combinations of columns to analyze.
The data set is structured to describe the findings of all background checks done per month per state.
Main portion of the project¶
Importing Libraries and Data¶
import seaborn as sns
import numpy as np
import pandas as pd
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss
df = pd.read_csv('nics-firearm-background-checks.csv')
df.shape
(15400, 27)
df.columns
Index(['month', 'state', 'permit', 'permit_recheck', 'handgun', 'long_gun',
'other', 'multiple', 'admin', 'prepawn_handgun', 'prepawn_long_gun',
'prepawn_other', 'redemption_handgun', 'redemption_long_gun',
'redemption_other', 'returned_handgun', 'returned_long_gun',
'returned_other', 'rentals_handgun', 'rentals_long_gun',
'private_sale_handgun', 'private_sale_long_gun', 'private_sale_other',
'return_to_seller_handgun', 'return_to_seller_long_gun',
'return_to_seller_other', 'totals'],
dtype='object')
df
month | state | permit | permit_recheck | handgun | long_gun | other | multiple | admin | prepawn_handgun | ... | returned_other | rentals_handgun | rentals_long_gun | private_sale_handgun | private_sale_long_gun | private_sale_other | return_to_seller_handgun | return_to_seller_long_gun | return_to_seller_other | totals | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-02 | Alabama | 25401.0 | 499.0 | 21822.0 | 14541.0 | 1351.0 | 1260 | 0.0 | 13.0 | ... | 0.0 | 0.0 | 0.0 | 28.0 | 29.0 | 2.0 | 1.0 | 0.0 | 0.0 | 69098 |
1 | 2022-02 | Alaska | 301.0 | 0.0 | 2644.0 | 2178.0 | 348.0 | 202 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5916 |
2 | 2022-02 | Arizona | 2560.0 | 473.0 | 20150.0 | 9935.0 | 1690.0 | 1153 | 0.0 | 11.0 | ... | 1.0 | 0.0 | 0.0 | 15.0 | 13.0 | 0.0 | 0.0 | 2.0 | 0.0 | 38149 |
3 | 2022-02 | Arkansas | 1842.0 | 309.0 | 7780.0 | 5756.0 | 429.0 | 515 | 4.0 | 15.0 | ... | 0.0 | 0.0 | 0.0 | 5.0 | 11.0 | 1.0 | 0.0 | 0.0 | 0.0 | 19002 |
4 | 2022-02 | California | 15815.0 | 10550.0 | 36362.0 | 23017.0 | 4941.0 | 1 | 0.0 | 1.0 | ... | 183.0 | 0.0 | 0.0 | 7638.0 | 3090.0 | 626.0 | 19.0 | 20.0 | 0.0 | 106295 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
15395 | 1998-11 | Virginia | 0.0 | NaN | 14.0 | 2.0 | NaN | 8 | 0.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 24 |
15396 | 1998-11 | Washington | 1.0 | NaN | 65.0 | 286.0 | NaN | 8 | 1.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 361 |
15397 | 1998-11 | West Virginia | 3.0 | NaN | 149.0 | 251.0 | NaN | 5 | 0.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 408 |
15398 | 1998-11 | Wisconsin | 0.0 | NaN | 25.0 | 214.0 | NaN | 2 | 0.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 241 |
15399 | 1998-11 | Wyoming | 8.0 | NaN | 45.0 | 49.0 | NaN | 5 | 0.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 107 |
15400 rows × 27 columns
df['dt_month'] = pd.to_datetime(df['month']).dt.month
df['dt_year'] = pd.to_datetime(df['month']).dt.year
More Handguns than other Firearms¶
permit and permit_recheck¶
df2 = df[df['permit'].notna()]
df2 = df2[df['permit_recheck'].notna()]
df2['more_handgun'] = df['handgun'] > (df['long_gun'] + df['other'])
df2.shape
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
This is separate from the ipykernel package so we can avoid doing imports until
(4015, 30)
df2.columns
Index(['month', 'state', 'permit', 'permit_recheck', 'handgun', 'long_gun',
'other', 'multiple', 'admin', 'prepawn_handgun', 'prepawn_long_gun',
'prepawn_other', 'redemption_handgun', 'redemption_long_gun',
'redemption_other', 'returned_handgun', 'returned_long_gun',
'returned_other', 'rentals_handgun', 'rentals_long_gun',
'private_sale_handgun', 'private_sale_long_gun', 'private_sale_other',
'return_to_seller_handgun', 'return_to_seller_long_gun',
'return_to_seller_other', 'totals', 'dt_month', 'dt_year',
'more_handgun'],
dtype='object')
df2
month | state | permit | permit_recheck | handgun | long_gun | other | multiple | admin | prepawn_handgun | ... | private_sale_handgun | private_sale_long_gun | private_sale_other | return_to_seller_handgun | return_to_seller_long_gun | return_to_seller_other | totals | dt_month | dt_year | more_handgun | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-02 | Alabama | 25401.0 | 499.0 | 21822.0 | 14541.0 | 1351.0 | 1260 | 0.0 | 13.0 | ... | 28.0 | 29.0 | 2.0 | 1.0 | 0.0 | 0.0 | 69098 | 2 | 2022 | True |
1 | 2022-02 | Alaska | 301.0 | 0.0 | 2644.0 | 2178.0 | 348.0 | 202 | 0.0 | 0.0 | ... | 2.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5916 | 2 | 2022 | True |
2 | 2022-02 | Arizona | 2560.0 | 473.0 | 20150.0 | 9935.0 | 1690.0 | 1153 | 0.0 | 11.0 | ... | 15.0 | 13.0 | 0.0 | 0.0 | 2.0 | 0.0 | 38149 | 2 | 2022 | True |
3 | 2022-02 | Arkansas | 1842.0 | 309.0 | 7780.0 | 5756.0 | 429.0 | 515 | 4.0 | 15.0 | ... | 5.0 | 11.0 | 1.0 | 0.0 | 0.0 | 0.0 | 19002 | 2 | 2022 | True |
4 | 2022-02 | California | 15815.0 | 10550.0 | 36362.0 | 23017.0 | 4941.0 | 1 | 0.0 | 1.0 | ... | 7638.0 | 3090.0 | 626.0 | 19.0 | 20.0 | 0.0 | 106295 | 2 | 2022 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4010 | 2016-02 | Virginia | 784.0 | 0.0 | 30085.0 | 15948.0 | 1133.0 | 0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 47955 | 2 | 2016 | True |
4011 | 2016-02 | Washington | 15736.0 | 0.0 | 20583.0 | 11991.0 | 1832.0 | 863 | 1.0 | 3.0 | ... | 578.0 | 422.0 | 30.0 | 5.0 | 15.0 | 0.0 | 56043 | 2 | 2016 | True |
4012 | 2016-02 | West Virginia | 3527.0 | 0.0 | 10746.0 | 7436.0 | 357.0 | 757 | 5.0 | 6.0 | ... | 11.0 | 5.0 | 1.0 | 3.0 | 2.0 | 0.0 | 27216 | 2 | 2016 | True |
4013 | 2016-02 | Wisconsin | 9420.0 | 0.0 | 19465.0 | 12431.0 | 821.0 | 62 | 0.0 | 0.0 | ... | 5.0 | 15.0 | 0.0 | 0.0 | 0.0 | 0.0 | 42855 | 2 | 2016 | True |
4014 | 2016-02 | Wyoming | 551.0 | 0.0 | 2287.0 | 2036.0 | 139.0 | 150 | 0.0 | 3.0 | ... | 0.0 | 4.0 | 0.0 | 1.0 | 1.0 | 0.0 | 5703 | 2 | 2016 | True |
4015 rows × 30 columns
X_colnames = ['permit','permit_recheck']
y_colname = 'more_handgun'
X = df2.loc[:,X_colnames].copy()
y = df2.loc[:,y_colname].copy()
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
clf = KNeighborsClassifier(n_neighbors = 10)
clf.fit(X_scaled,y)
KNeighborsClassifier(n_neighbors=10)
X_scaled_train, X_scaled_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.4)
clf2 = KNeighborsClassifier(n_neighbors = 10)
clf2.fit(X_scaled_train,y_train)
KNeighborsClassifier(n_neighbors=10)
probs = clf2.predict_proba(X_scaled_test)
log_loss(y_test,probs)
0.7955703072201001
probs
array([[0.2, 0.8],
[0.3, 0.7],
[0.1, 0.9],
...,
[0.2, 0.8],
[0.4, 0.6],
[0.1, 0.9]])
df2['probsSer'] = pd.Series(probs[:,1])
private_sale_handgun an private_sale_not_handgun¶
df3 = df[df['private_sale_handgun'].notna()]
df3 = df3[df['handgun'].notna()]
df3 = df3[df['long_gun'].notna()]
df3 = df3[df['other'].notna()]
df3['more_handgun'] = df['handgun'] > (df['long_gun'] + df['other'])
df3.shape
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
This is separate from the ipykernel package so we can avoid doing imports until
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:4: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
after removing the cwd from sys.path.
(5665, 30)
df3['private_sale_not_handgun'] = df2.loc[:,'private_sale_long_gun'] + df2.loc[:,'private_sale_other']
## X2_colnames = ['private_sale_handgun','private_sale_long_gun','private_sale_other']
X2_colnames = ['private_sale_handgun','private_sale_not_handgun']
y2_colname = 'more_handgun'
X2 = df2.loc[:,X_colnames].copy()
y2 = df2.loc[:,y_colname].copy()
scaler = StandardScaler()
scaler.fit(X2)
X2_scaled = scaler.transform(X2)
clf3 = KNeighborsClassifier(n_neighbors = 10)
clf3.fit(X2_scaled,y2)
KNeighborsClassifier(n_neighbors=10)
X2_scaled_train, X2_scaled_test, y2_train, y2_test = train_test_split(X2_scaled,y2,test_size=0.4)
clf4 = KNeighborsClassifier(n_neighbors = 10)
clf4.fit(X2_scaled_train,y2_train)
KNeighborsClassifier(n_neighbors=10)
probs2 = clf4.predict_proba(X2_scaled_test)
log_loss(y2_test,probs2)
0.8214485407353612
df3['probs2Ser'] = pd.Series(probs2[:,1])
State¶
handgun and long_gun¶
df4 = df[df['state'].notna()]
df4 = df4[df4['handgun'].notna()]
df4 = df4[df4['long_gun'].notna()]
df4.shape
(15380, 29)
X3_colnames = ['handgun','long_gun']
y3_colname = 'state'
X3 = df4.loc[:,X3_colnames].copy()
y3 = df4.loc[:,y3_colname].copy()
scaler = StandardScaler()
scaler.fit(X3)
X3_scaled = scaler.transform(X3)
clf5 = KNeighborsClassifier(n_neighbors = 5)
clf5.fit(X3_scaled,y3)
KNeighborsClassifier()
X3_scaled_train, X3_scaled_test, y3_train, y3_test = train_test_split(X3_scaled,y3,test_size=0.4)
clf6 = KNeighborsClassifier(n_neighbors = 5)
clf6.fit(X3_scaled_train,y3_train)
KNeighborsClassifier()
probs3 = clf6.predict_proba(X3_scaled_test)
log_loss(y3_test,probs3)
16.72825544530875
Graphing¶
permit and permit_recheck for predicting when there are more handguns¶
alt.data_transformers.disable_max_rows()
graph = alt.Chart(df2).mark_bar().encode(
x = 'more_handgun',
y = 'permit',
color = 'more_handgun',
tooltip = ['more_handgun']
)
graph2 = alt.Chart(df2).mark_bar().encode(
x = 'more_handgun',
y = 'permit_recheck',
color = 'more_handgun',
tooltip = ['more_handgun']
)
graph|graph2
alt.data_transformers.disable_max_rows()
graph3 = alt.Chart(df2).mark_bar().encode(
x = 'more_handgun',
y = 'permit',
color = 'probsSer',
tooltip = ['probsSer','more_handgun']
)
alt.data_transformers.disable_max_rows()
graph4 = alt.Chart(df2).mark_bar().encode(
x = 'more_handgun',
y = 'permit',
color = 'probsSer',
tooltip = ['probsSer','more_handgun']
)
graph3|graph4
private_sale_handgun an private_sale_not_handgun for predicting when there are more handguns¶
alt.data_transformers.disable_max_rows()
graph5 = alt.Chart(df3).mark_bar().encode(
x = 'more_handgun',
y = 'private_sale_handgun',
color = 'more_handgun',
tooltip = ['more_handgun']
)
graph6 = alt.Chart(df3).mark_bar().encode(
x = 'more_handgun',
y = 'private_sale_not_handgun',
color = 'more_handgun',
tooltip = ['more_handgun']
)
graph5|graph6
alt.data_transformers.disable_max_rows()
graph7 = alt.Chart(df3).mark_bar().encode(
x = 'more_handgun',
y = 'private_sale_handgun',
color = 'probs2Ser',
tooltip = ['probs2Ser','more_handgun']
)
alt.data_transformers.disable_max_rows()
graph8 = alt.Chart(df3).mark_bar().encode(
x = 'more_handgun',
y = 'private_sale_not_handgun',
color = 'probs2Ser',
tooltip = ['probs2Ser','more_handgun']
)
graph7|graph8
Accompanying Documentation¶
###Importing Libraries and Data
To start, we will import all required libraries.
Next, we will import the nics-firearm-background-checks.csv file into the project to be analyzed, initializing it as a DataFrame df. We take note of the shape as well as the columns of df.
We will also create some datetime columns to be used later in the project.
More Handguns than other Firearms¶
From there, we will create a new DataFrame for each new comparison between columns, as we may need to remove different NA data rows depending on what we are analyzing.
To start, we will find a good option to predict whether the given background checks have more handguns as opposed to long guns and other types of firearms. We use KNeighborsClassifier for this data. My initial idea was to use how many permits and permit rechecks were seen in the background checks, but my log_loss was fairly high, with the number being around 0.9, so this might not be the best for predicting when the background checks have more handguns.
This time we try to use private_sale_handgun and a new column that we just made called private_sale_not_handgun. After running through it just like before with KNeighborsClassifier, we get a lower log_loss of approximately 0.75.
State¶
To predict state, I went with
When trying to predict the state from the data of background checks, we run into a few problems. Namely, based on the nature of how states’ populations tend to treat gun control and the stigma around owning firearms, it is highly likely that our program would mistake many states for others. We also cannot replace one of the comparative columns with another, as they will yield similar results.
Graphing¶
For graphing, I decided to show what each column looked like with respect to what we were trying to predict. I used a bar charts through Altair.
Earlier in the project I also created columns to represent the probability for there to be more handguns with probsSer and probs2Ser to allow for a bar graph that would show more insightful information for each data point.
Summary¶
Through this project, I unfortunately found that making predictions based on the data provided is much more difficult than I initially thought. The log_loss was the biggest indicator of whether or not the insight provided by using machine learning was useful, as the numbers were uncomfortably high even after adjusting test size and the n_neighbors significantly (within reason for the size of the data). However, the use of probsSer told me a lot about how well the machine learning works when compared directly with the data.