Exploring Recidivism in Iowa#

Author: Nicole Vuong

Course Project, UC Irvine, Math 10, S23

Introduction#

Recidivism is an important factor in exploring long-term outcomes of offenders and the effectiveness of our criminal justice system. In my project, I create a model to predict recidivism among Iowan prisoners from 2010-2018, given certain factors such as age, the crime committed, and race/ethnicity. (Recidivism was tracked in a three-year period following an individual’s release. The releases date anywhere between the fiscal years of 2010 and 2015.) I also aim to explore the data through visual methods using Altair, to see what types of crimes are repeated the most, for instance.

import pandas as pd
from pandas.api.types import is_numeric_dtype
import altair as alt
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

Preliminary exploration/cleaning of the data#

First, I import the Iowa Recidivism data set. This data set contains both people who reoffended and people who did not reoffend within a three year period of their release. The data set was attained from: https://www.kaggle.com/datasets/slonnadube/recidivism-for-offenders-released-from-prison?select=3-Year_Recidivism_for_Offenders_Released_from_Prison_in_Iowa_elaborated.csv

df = pd.read_csv("IowaRecidivism.csv")
df.head()
Fiscal Year Released Recidivism Reporting Year Race - Ethnicity Age At Release Convicting Offense Classification Convicting Offense Type Convicting Offense Subtype Main Supervising District Release Type Release type: Paroled to Detainder united Part of Target Population Recidivism - Return to Prison numeric
0 2010 2013 White - Non-Hispanic Under 25 D Felony Violent Assault 4JD Parole Parole Yes 1
1 2010 2013 White - Non-Hispanic 55 and Older D Felony Public Order OWI 7JD Parole Parole Yes 1
2 2010 2013 White - Non-Hispanic 25-34 D Felony Property Burglary 5JD Parole Parole Yes 1
3 2010 2013 White - Non-Hispanic 55 and Older C Felony Drug Trafficking 8JD Parole Parole Yes 1
4 2010 2013 Black - Non-Hispanic 25-34 D Felony Drug Trafficking 3JD Parole Parole Yes 1
df.columns
Index(['Fiscal Year Released', 'Recidivism Reporting Year', 'Race - Ethnicity',
       'Age At Release ', 'Convicting Offense Classification',
       'Convicting Offense Type', 'Convicting Offense Subtype',
       'Main Supervising District', 'Release Type',
       'Release type: Paroled to Detainder united',
       'Part of Target Population', 'Recidivism - Return to Prison numeric'],
      dtype='object')

For convenience later, I shorten the name of the “Recidivism - Return to Prison numeric” column to “Recidivism” and “Age At Release ” to “Age”.

df = df.rename(columns={"Recidivism - Return to Prison numeric": "Recidivism", "Age At Release ": "Age"})

Now we check if there are any missing values:

df.isna().sum(axis=0)
Fiscal Year Released                            0
Recidivism Reporting Year                       0
Race - Ethnicity                               30
Age                                             3
Convicting Offense Classification               0
Convicting Offense Type                         0
Convicting Offense Subtype                      0
Main Supervising District                    9581
Release Type                                 1762
Release type: Paroled to Detainder united    1762
Part of Target Population                       0
Recidivism                                      0
dtype: int64

Since there are missing values, we drop the rows containing missing values.

df = df.dropna(axis=0)

As I am interested in using the “Race - Ethnicity” column for later classification predictions, I want to check out what the different values of this column are.

df["Race - Ethnicity"].unique()
array(['White - Non-Hispanic', 'Black - Non-Hispanic', 'White - Hispanic',
       'American Indian or Alaska Native - Non-Hispanic', 'White -',
       'Asian or Pacific Islander - Non-Hispanic', 'Black - Hispanic',
       'American Indian or Alaska Native - Hispanic', 'Black -',
       'Asian or Pacific Islander - Hispanic'], dtype=object)
df["Race - Ethnicity"].value_counts()
White - Non-Hispanic                               11537
Black - Non-Hispanic                                3717
White - Hispanic                                     789
American Indian or Alaska Native - Non-Hispanic      242
Asian or Pacific Islander - Non-Hispanic             112
Black - Hispanic                                      23
American Indian or Alaska Native - Hispanic           13
Asian or Pacific Islander - Hispanic                   2
Black -                                                2
White -                                                1
Name: Race - Ethnicity, dtype: int64

I want to make sure rows have both the ethnicity and race, so I will drop rows with “Black -” or “White -” as the “Race - Ethnicity” value.

df = df[~((df["Race - Ethnicity"] == "Black -") | (df["Race - Ethnicity"] == "White -"))]

Visualizing the data#

I now want to display the portion of people who committed recidivism in each of the following categories: race-ethnicity, age, and convicting offense subtype. However, as df is larger than 5000 rows, I decided to first make “mini” dataframes for each category, then graph each of those “mini” dataframes as a bar graph.

The total_chart line is adapted from Worksheet 7 from Math 10 class, and I referenced the following link for the aggregating in the make_chart definition: https://altair-viz.github.io/user_guide/transform/aggregate.html

columns = ["Race - Ethnicity", "Age", "Convicting Offense Subtype"]
def make_df(c):
    x = df.groupby(c)["Recidivism"].mean()
    df1 = x.to_frame()
    df1[c] = df1.index
    return df1

def make_chart(df1, c):
    graph = alt.Chart(df1).mark_bar().encode(
        x = c,
        y = "mean(Recidivism)",
        color = "mean(Recidivism):Q"
    )
    return graph

chart_list = []

for i in columns:
    df1 = make_df(i)
    chart = make_chart(df1, i)
    chart_list.append(chart)

total_chart = alt.vconcat(*chart_list)
total_chart
for i in columns: 
    print(df[i].value_counts())
    print("")
White - Non-Hispanic                               11537
Black - Non-Hispanic                                3717
White - Hispanic                                     789
American Indian or Alaska Native - Non-Hispanic      242
Asian or Pacific Islander - Non-Hispanic             112
Black - Hispanic                                      23
American Indian or Alaska Native - Hispanic           13
Asian or Pacific Islander - Hispanic                   2
Name: Race - Ethnicity, dtype: int64

25-34           6093
35-44           4102
45-54           2800
Under 25        2675
55 and Older     765
Name: Age, dtype: int64

Trafficking                        5188
Burglary                           1811
Theft                              1563
Assault                            1379
OWI                                1074
Other Criminal                      891
Forgery/Fraud                       763
Drug Possession                     760
Sex                                 705
Other Violent                       372
Traffic                             317
Murder/Manslaughter                 285
Robbery                             225
Weapons                             219
Other Drug                          180
Vandalism                           166
Other Public Order                  125
Alcohol                             122
Arson                               105
Special Sentence Revocation          53
Flight/Escape                        48
Sex Offender Registry/Residency      35
Kidnap                               31
Prostitution/Pimping                 13
Stolen Property                       3
Animals                               2
Name: Convicting Offense Subtype, dtype: int64
  • I caution against taking the results of the race-ethnicity bar graph above at face value. For race-ethnicity groups with larger counts (such as the white non-Hispanic group at 11000+), the proportion of those who committed recidivism is possibly more indicative of a larger population, such as prisoners in the Midwest. However, for a group like Asian or Pacific Islander - Hispanic with only 2 instances, the 50% recidivism rate as reported by the bar chart is probably not representative of a larger population.

  • Interestingly, it looks like there is a negative correlation between the age group and the recidivism rate. (As the age increased, the proportion of individuals committing recidivism decreased.) Also, each age group has a good number of people, so we can accept the recidivism rates with more certainty than, say, some of the race-ethnicity groups’ recidivism rates.

  • No one who originally committed a crime under the “Animals” offense subtype-offended. Again however, only two people committed an offense with the “Animals” subtype, so these results may not be representative of a larger population. Interestingly however, the “Murder/Manslaughter” subtype has one of the lowest recidivism rates at ~17-18%, with a decently sized group at 285 individuals.

Predicting recidivism using K-nearest neighbors classifier#

I want to predict whether a person commits recidivism within three years of their release based on race, age, and convicting offense type. As these columns above have string type elements, we must first use one hot encoder to be able to use a numerical form of this information. The following code pertaining to one hot encoder is adapted from: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

columns = ["Race - Ethnicity", "Age", "Convicting Offense Subtype"]
for i in columns:
    ohe = OneHotEncoder()
    ohe.fit(df[[i]])
    arr = ohe.transform(df[[i]]).toarray()
    add_list = [ohe.get_feature_names_out()[i] for i in range(len(df[i].unique()))]
    df[add_list] = arr
df.columns
Index(['Fiscal Year Released', 'Recidivism Reporting Year', 'Race - Ethnicity',
       'Age', 'Convicting Offense Classification', 'Convicting Offense Type',
       'Convicting Offense Subtype', 'Main Supervising District',
       'Release Type', 'Release type: Paroled to Detainder united',
       'Part of Target Population', 'Recidivism',
       'Race - Ethnicity_American Indian or Alaska Native - Hispanic',
       'Race - Ethnicity_American Indian or Alaska Native - Non-Hispanic',
       'Race - Ethnicity_Asian or Pacific Islander - Hispanic',
       'Race - Ethnicity_Asian or Pacific Islander - Non-Hispanic',
       'Race - Ethnicity_Black - Hispanic',
       'Race - Ethnicity_Black - Non-Hispanic',
       'Race - Ethnicity_White - Hispanic',
       'Race - Ethnicity_White - Non-Hispanic', 'Age_25-34', 'Age_35-44',
       'Age_45-54', 'Age_55 and Older', 'Age_Under 25',
       'Convicting Offense Subtype_Alcohol',
       'Convicting Offense Subtype_Animals',
       'Convicting Offense Subtype_Arson',
       'Convicting Offense Subtype_Assault',
       'Convicting Offense Subtype_Burglary',
       'Convicting Offense Subtype_Drug Possession',
       'Convicting Offense Subtype_Flight/Escape',
       'Convicting Offense Subtype_Forgery/Fraud',
       'Convicting Offense Subtype_Kidnap',
       'Convicting Offense Subtype_Murder/Manslaughter',
       'Convicting Offense Subtype_OWI',
       'Convicting Offense Subtype_Other Criminal',
       'Convicting Offense Subtype_Other Drug',
       'Convicting Offense Subtype_Other Public Order',
       'Convicting Offense Subtype_Other Violent',
       'Convicting Offense Subtype_Prostitution/Pimping',
       'Convicting Offense Subtype_Robbery', 'Convicting Offense Subtype_Sex',
       'Convicting Offense Subtype_Sex Offender Registry/Residency',
       'Convicting Offense Subtype_Special Sentence Revocation',
       'Convicting Offense Subtype_Stolen Property',
       'Convicting Offense Subtype_Theft',
       'Convicting Offense Subtype_Traffic',
       'Convicting Offense Subtype_Trafficking',
       'Convicting Offense Subtype_Vandalism',
       'Convicting Offense Subtype_Weapons'],
      dtype='object')

We now extract the one hot encoder columns corresponding to race-ethnicity, age, and convicting offense type.

features = df.columns[12:]

We should make sure our trained model is not overfitting the data. So, we first divide the data into a training set and a test set. Then, we fit a K-nearest neighbors classifier to the training set, and compare the scores (accuracy of prediction) attained from the training set and the test set.

X_train, X_test, y_train, y_test = train_test_split(df[features], df["Recidivism"], train_size=0.6, random_state=0)
clf = KNeighborsClassifier(n_neighbors=20)
clf.fit(X_train, y_train)
clf.score(X_train, y_train)
0.6121083054456952
clf.score(X_test, y_test)
0.5897474901125647

The score from the training set is within about 2% of the score from the test set, so it doesn’t look like overfitting is occurring here. However, the scores aren’t that high either (the scores are not much higher than 0.5, the expected score by just guessing randomly). This means that it is difficult to predict whether someone commits recidivism based solely on race-ethnicity, age, and the crime committed.

Visualizing data from the reoffending-only dataframe#

Below, we investigate a sub dataframe of the above dataframe (the sub dataframe was also attained from Kaggle, the link is: https://www.kaggle.com/datasets/slonnadube/recidivism-for-offenders-released-from-prison?select=prison_recidivists_with_recidivism_type_only.csv). It contains only the rows of the above dataframe corresponding to people who did reoffend, along with additional columns like the number of days it took before someone returned to prison. (These additional columns are not provided in the dataframe above. I assume that in an original dataframe including all people (reoffending or not), these columns would’ve been included above too - albeit with empty values. I tried to find the original dataset, but the site from which the Kaggle dataset was taken from is now down.)

df_new = pd.read_csv("ReoffendingOnly.csv")
df_new.head()
Fiscal Year Released Recidivism Reporting Year Race - Ethnicity Age At Release Convicting Offense Classification Convicting Offense Type Convicting Offense Subtype Release Type Part of Target Population Main Supervising District Recidivism - Return to Prison Release Type.1 Days to Recidivism New Conviction Offense Classification New Conviction Offense Type New Conviction Offense Sub Type
0 2010 2013 White - Non-Hispanic 45-54 Aggravated Misdemeanor Public Order Sex Offender Registry/Residency Parole Yes 3JD Yes Parole 28 D Felony Property Theft
1 2010 2013 White - Non-Hispanic Under 25 Aggravated Misdemeanor Property Theft Discharged End of Sentence No NaN Yes Discharged End of Sentence 49 C Felony Property Theft
2 2010 2013 White - Non-Hispanic Under 25 D Felony Drug Trafficking Parole Yes 4JD Yes Parole 53 Aggravated Misdemeanor Public Order OWI
3 2010 2013 American Indian or Alaska Native - Non-Hispanic 25-34 D Felony Violent Assault Discharged End of Sentence No NaN Yes Discharged End of Sentence 57 C Felony Violent Robbery
4 2010 2013 White - Non-Hispanic Under 25 D Felony Property Vandalism Parole Yes 2JD Yes Parole 58 Aggravated Misdemeanor Public Order Alcohol

Like above, we drop any rows with missing data and rename some columns for convenience.

df_new = df_new.dropna(axis=0)
df_new = df_new.rename(columns={"Recidivism - Return to Prison": "Recidivism", "Age At Release ": "Age"})

We now would like to visualize the subtype of the convicting (original) offense subtype against the number of days before someone reoffended following release.

alt.Chart(df_new).mark_circle().encode(
    x = "Convicting Offense Subtype",
    y = "Days to Recidivism",
    color = "Age:N"
).properties(
    width=500,
    height=500
)

The dots within each category are quite uniformly spread out, meaning that there isn’t really any correlation between the original offense subtype committed and the length of time it took for someone to commit another offense.

Now, we would like to see what proportion of re-offending individuals have a repeat convicting offense type (i.e. the new convicting offense type is the same as the original convicting offense type).

df_new["Same Offense Type"] = df_new["Convicting Offense Type"] == df_new["New Conviction Offense Type"]
prop_same = df_new["Same Offense Type"].sum()/len(df_new)
print(f"{prop_same * 100}% of reoffending people re-committed the same offense type.")
64.54241537818638% of reoffending people re-committed the same offense type.

From the above, we know that a significant portion of re-offending people re-offend with the same crime type. We now use Altair to create a pie chart, to see which crimes were repeated most often. I referenced the following link to help create the pie chart: https://altair-viz.github.io/altair-viz-v4/gallery/pie_chart.html

alt.Chart(df_new).mark_arc().encode(
    theta=alt.Theta(field="Same Offense Type", type="quantitative"),
    color=alt.Color(field="Convicting Offense Type", type="nominal", scale=alt.Scale(scheme="category20")),
)

From the above, we can see that of those who repeat an offense type, an offense related to drugs is the most prevalent.

Predicting the subtype of the new offense using random forest classifier#

We will now use a random forest classifier to predict the subtype (arson, theft, etc) of the new offense, based off of the following input features: race-ethnicity, age, days to recidivism, the original offense subtype, and whether a repeat offense TYPE (drug, property, etc) was committed. However, like the above data, we first need to use one hot encoder to be able to use the race-ethnicity, age, and convicting offense subtype columns as numeric information.

columns = ["Race - Ethnicity", "Age", "Convicting Offense Subtype"]
for i in columns:
    ohe = OneHotEncoder()
    ohe.fit(df_new[[i]])
    arr = ohe.transform(df_new[[i]]).toarray()
    add_list = [ohe.get_feature_names_out()[i] for i in range(len(df_new[i].unique()))]
    df_new[add_list] = arr

Then, we extract the columns with elements that have a numeric type (our desired input features will be a subset of these columns).

numeric_cols = [x for x in df_new.dtypes.index if is_numeric_dtype(df_new.dtypes[x])]
numeric_cols
['Fiscal Year Released',
 'Recidivism Reporting Year',
 'Days to Recidivism',
 'Same Offense Type',
 'Race - Ethnicity_American Indian or Alaska Native - Hispanic',
 'Race - Ethnicity_American Indian or Alaska Native - Non-Hispanic',
 'Race - Ethnicity_Asian or Pacific Islander - Hispanic',
 'Race - Ethnicity_Asian or Pacific Islander - Non-Hispanic',
 'Race - Ethnicity_Black - Hispanic',
 'Race - Ethnicity_Black - Non-Hispanic',
 'Race - Ethnicity_White - Hispanic',
 'Race - Ethnicity_White - Non-Hispanic',
 'Age_25-34',
 'Age_35-44',
 'Age_45-54',
 'Age_55 and Older',
 'Age_Under 25',
 'Convicting Offense Subtype_Alcohol',
 'Convicting Offense Subtype_Arson',
 'Convicting Offense Subtype_Assault',
 'Convicting Offense Subtype_Burglary',
 'Convicting Offense Subtype_Drug Possession',
 'Convicting Offense Subtype_Flight/Escape',
 'Convicting Offense Subtype_Forgery/Fraud',
 'Convicting Offense Subtype_Kidnap',
 'Convicting Offense Subtype_Murder/Manslaughter',
 'Convicting Offense Subtype_OWI',
 'Convicting Offense Subtype_Other Criminal',
 'Convicting Offense Subtype_Other Drug',
 'Convicting Offense Subtype_Other Public Order',
 'Convicting Offense Subtype_Other Violent',
 'Convicting Offense Subtype_Prostitution/Pimping',
 'Convicting Offense Subtype_Robbery',
 'Convicting Offense Subtype_Sex',
 'Convicting Offense Subtype_Sex Offender Registry/Residency',
 'Convicting Offense Subtype_Special Sentence Revocation',
 'Convicting Offense Subtype_Stolen Property',
 'Convicting Offense Subtype_Theft',
 'Convicting Offense Subtype_Traffic',
 'Convicting Offense Subtype_Trafficking',
 'Convicting Offense Subtype_Vandalism',
 'Convicting Offense Subtype_Weapons']

Now, we remove the first two names in numeric_cols, to get a list of only the inputs we want.

inputs = numeric_cols[2:]

Like above, we split the data set into a training set and a test set, so that we can check if the model created by the RandomForestClassifier object is overfitting (by comparing scores between the test set and the training set). Then, we instantiate a RandomForestClassifier object and fit the training set.

X_train, X_test, y_train, y_test = train_test_split(df_new[inputs], df_new["New Conviction Offense Sub Type"], train_size=0.6, random_state=0)
clf = RandomForestClassifier(n_estimators=40, max_depth=40, max_leaf_nodes=40, random_state=0)
clf.fit(X_train, y_train)
clf.score(X_train, y_train)
0.5827237896203413
clf.score(X_test, y_test)
0.5347258485639687
len(df_new["New Conviction Offense Sub Type"].unique())
25

As the score from the test data set is about 5% within the score from the training data set, the model is probably not overfitting. The model does better than randomly guessing (given that there are 25 unique new conviction offense subtypes, the expected score from simple random guessing is 4%). That being said, since the score isn’t the best (58% on the training set), the score indicates that it is difficult to predict the new crime someone commits based off of age, days to recidivism, race-ethnicity, the subtype of the original crime, and whether they committed the same offense type again (i.e. “violent”).

Further analyzing the random forest classifier results#

I was interested in seeing what features (of the 5 used - age, days to recidivism, race-ethnicity, the subtype of the original crime, and whether they committed the same offense type again) were most important in predicting the new offense subtype.

The following code is adapted from Worksheet 18 from Math 10.

pd.Series(clf.feature_importances_, index=clf.feature_names_in_).sort_values(ascending=False)
Convicting Offense Subtype_Trafficking                              0.207784
Convicting Offense Subtype_Other Criminal                           0.106146
Convicting Offense Subtype_Burglary                                 0.105836
Same Offense Type                                                   0.098683
Convicting Offense Subtype_Theft                                    0.085817
Convicting Offense Subtype_OWI                                      0.063224
Convicting Offense Subtype_Assault                                  0.060549
Convicting Offense Subtype_Drug Possession                          0.053672
Convicting Offense Subtype_Forgery/Fraud                            0.039891
Days to Recidivism                                                  0.029197
Convicting Offense Subtype_Other Violent                            0.018633
Convicting Offense Subtype_Sex                                      0.017597
Convicting Offense Subtype_Alcohol                                  0.015739
Convicting Offense Subtype_Traffic                                  0.014048
Convicting Offense Subtype_Murder/Manslaughter                      0.009256
Convicting Offense Subtype_Weapons                                  0.007587
Convicting Offense Subtype_Other Public Order                       0.006944
Convicting Offense Subtype_Other Drug                               0.006535
Convicting Offense Subtype_Arson                                    0.006402
Convicting Offense Subtype_Robbery                                  0.006386
Race - Ethnicity_Black - Non-Hispanic                               0.006152
Convicting Offense Subtype_Vandalism                                0.005051
Race - Ethnicity_White - Non-Hispanic                               0.004780
Age_35-44                                                           0.003818
Age_25-34                                                           0.003805
Age_Under 25                                                        0.003742
Convicting Offense Subtype_Flight/Escape                            0.003710
Age_45-54                                                           0.002586
Race - Ethnicity_White - Hispanic                                   0.001644
Race - Ethnicity_American Indian or Alaska Native - Non-Hispanic    0.001164
Age_55 and Older                                                    0.001102
Convicting Offense Subtype_Sex Offender Registry/Residency          0.001090
Convicting Offense Subtype_Prostitution/Pimping                     0.000469
Convicting Offense Subtype_Special Sentence Revocation              0.000377
Race - Ethnicity_Black - Hispanic                                   0.000249
Race - Ethnicity_Asian or Pacific Islander - Non-Hispanic           0.000188
Race - Ethnicity_American Indian or Alaska Native - Hispanic        0.000072
Convicting Offense Subtype_Stolen Property                          0.000071
Convicting Offense Subtype_Kidnap                                   0.000000
Race - Ethnicity_Asian or Pacific Islander - Hispanic               0.000000
dtype: float64

From the above, the original offense subtype is, for the most part, the most important feature. Whether someone committed the same offense type is also deemed to be quite important in comparison to other factors like race-ethnicity and age. This makes sense, as we are predicting the new offense subtype. Since we know that ~65% of Iowan re-offending prisoners re-offend with a crime in the same offense type as their previous crime, it makes sense that the new offense subtype is likely closely related (if not the same) as the original offense subtype.

Now, we would like to visualize what predictions match the actual new offense subtype. First, we create a dataframe of the actual new offense subtypes vs. the predicted subtype (from the test set).

df_result = pd.DataFrame({
    "New Conviction Offense Subtype": y_test,
    "Predicted Offense Subtype": clf.predict(X_test)
})

Using code adapted from Worksheet 16 from Math 10, we display the confusion matrix for df_result.

alt.data_transformers.enable('default', max_rows=15000)

c = alt.Chart(df_result).mark_rect().encode(
    x="New Conviction Offense Subtype:N",
    y="Predicted Offense Subtype:N",
    color=alt.Color("count()", scale=alt.Scale(scheme="spectral"))
)

c_text = alt.Chart(df_result).mark_text(color="white").encode(
    x="New Conviction Offense Subtype:N",
    y="Predicted Offense Subtype:N",
    text="count()"
)

(c+c_text).properties(
    height=400,
    width=800
)

We also want to see the total # of instances per new conviction offense subtype, along with the proportion of predictions that match the true new offense subtype by new offense subtype category.

df_result["New Conviction Offense Subtype"].value_counts()
Trafficking            488
Theft                  193
Burglary               182
Other Criminal         177
Drug Possession        174
Assault                171
OWI                    103
Forgery/Fraud           95
Traffic                 63
Flight/Escape           60
Other Public Order      49
Weapons                 33
Other Violent           21
Alcohol                 19
Sex                     16
Murder/Manslaughter     16
Vandalism               12
Other Drug              12
Arson                   11
Robbery                 10
Kidnap                   8
Other Property           1
Animals                  1
Name: New Conviction Offense Subtype, dtype: int64
df_result["Matching"] = df_result["New Conviction Offense Subtype"] == df_result["Predicted Offense Subtype"]
df_result.groupby("New Conviction Offense Subtype")["Matching"].mean().sort_values(ascending=False)
New Conviction Offense Subtype
Arson                  0.818182
Trafficking            0.813525
Other Violent          0.761905
Burglary               0.642857
Robbery                0.600000
Other Criminal         0.587571
Alcohol                0.578947
Murder/Manslaughter    0.562500
OWI                    0.543689
Theft                  0.538860
Forgery/Fraud          0.484211
Other Public Order     0.408163
Other Drug             0.333333
Assault                0.327485
Weapons                0.272727
Drug Possession        0.252874
Vandalism              0.250000
Traffic                0.174603
Flight/Escape          0.033333
Kidnap                 0.000000
Animals                0.000000
Other Property         0.000000
Sex                    0.000000
Name: Matching, dtype: float64
  • Notice that some of the new conviction offense subtypes were not predicted by the model, according to the confusion matrix. However, this makes sense, as those subtypes had very little occurrences in the test data set. For example, “Animals” was not predicted by the model, but it has only 1 occurrence in the test data set.

  • The model appears to have been very accurate with the “Trafficking” subtype category, with ~81% accuracy.

  • Interestingly, the model mistakenly predicted “Trafficking” for many instances in the “Drug Possession” category (it predicted “Trafficking” even more often than “Drug Possession” for instances belonging to “Drug Possession”!). However, this mistake is not out-of-the-blue, as “Trafficking” belongs to the same offense type as “Drug Possession” (the “Drug” offense type).

Summary#

In this project, I attempted to predict whether someone committed recidivism based on factors such as age, race, and the original crime committed. I also attempted to see whether the re-offending crime could be predicted based on certain factors, given someone did commit recidivism. I found that the predictions created (using a K-Nearest Neighbors classifier and Random Forest classifier respectively) were not too accurate, showing that it is hard to predict outcomes involving recidivism.

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

This dataset was called “recidivism_for_offenders_released_from_prison”, was uploaded by SLONNADUBE, and was attained from: https://www.kaggle.com/datasets/slonnadube/recidivism-for-offenders-released-from-prison

  • List any other references that you found helpful.

  • Worksheet 7, 16, and 18 from Math 10 S23

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote