Prediction of Canadian Car Accident#

Author: Nanako Tatewaki

59793326

Course Project, UC Irvine, Math 10, F22

Introduction#

I would like to use the “Canadian Car Accidents 1994-2014” dataset from the Kaggle (which we haven’t utilize this dataset for this class) to predict the data of the accident (gender, weather condition, road condition,etc…). In this project, I will utilize some method that we covered from the lecture, such as scikit-learn, Altair chart, classifier, and more for the machine learning. And for the extra topic, I choose the K Nearest Neighbor Classifier and Logistic Regression.

Importing Files#

Dataset from the Kaggle#

import libraries and the dataset

import pandas as pd
import numpy as np
import seaborn as sns
import altair as alt
import pandas as pd
import zipfile
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, log_loss
import plotly.express as px
from sklearn.tree import DecisionTreeClassifier
from pandas.api.types import is_numeric_dtype
from sklearn.preprocessing import OneHotEncoder
from sklearn.cluster import KMeans

df = pd.read_csv("caraccidents.zip",compression='zip')

/shared-libs/python3.7/py-core/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3258: DtypeWarning: Columns (1,2,5,12) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Clean Data#

Since the Data is too huge, we will choose 100,000 data randomly and clean the data to utilize later on.

df.shape

(5860405, 22)

df_sam = df.sample(100000, random_state = 1).copy()
#drop some unnecessary columns
df_sam = df_sam.drop(columns=['V_ID', 'P_ID'])
df_sam

	C_YEAR	C_MNTH	C_WDAY	C_HOUR	C_SEV	C_VEHS	C_CONF	C_RCFG	C_WTHR	C_RSUR	C_RALN	C_TRAF	V_TYPE	V_YEAR	P_SEX	P_AGE	P_PSN	P_ISEV	P_SAFE	P_USER
4348044	2010	4	5	16	2	2	35	02	1	1	1	03	01	1997	F	38	11	2	02	1
1937686	2003	9	1	08	2	02	35	03	1	1	1	18	01	1994	F	43	11	2	02	1
4916678	2011	12	6	12	2	01	02	02	1	1	1	01	NN	NNNN	F	39	99	2	NN	3
2881117	2006	2	2	17	2	02	35	01	4	2	1	18	08	2005	M	38	11	N	NN	1
1058840	2001	7	7	17	2	02	21	01	1	1	2	18	17	UUUU	M	55	11	1	09	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2352110	2004	9	6	15	2	02	21	01	1	1	1	18	01	1989	F	65	13	1	02	2
5658028	2014	5	2	18	2	2	36	02	1	1	1	03	11	2004	M	58	11	2	02	1
2998412	2006	6	3	06	2	3	21	01	1	1	1	18	01	1996	M	37	11	2	02	1
3659566	2008	3	3	08	2	2	21	02	3	2	3	04	01	2002	M	30	11	1	02	1
5718138	2014	7	5	16	2	1	06	03	1	1	1	18	01	2003	F	24	11	1	NN	1

100000 rows × 20 columns

Now, we will explore the interaction of gender and age. First, we will figure out what kind of genders are in the column of “P_SEX”, and what ages are in the column of “P_AGE” by using the value counts method.

Most of the data seems numeric, however, most of the columns are categorical value explained in the original PDF source.

The data frame contains missing values and uneven object types .

#U = ,N =
df_sam["P_SEX"].value_counts()

M    54190
F    41513
U     4080
N      217
Name: P_SEX, dtype: int64

#contains '12' and 12
df_sam["C_MNTH"].unique()

array([4, 9, '12', 2, 7, 10, 6, 11, '01', 5, 8, 3, 12, 1, 'UU', '02',
       '11'], dtype=object)

Convert all the columns except for gender to numeric object.

cols = [c for c in df_sam.columns if c != "P_SEX"]

Remove the entries which couldn’t be converted = missing value

df_sam[cols] = df_sam[cols].apply(pd.to_numeric, errors='coerce')
df_sam = df_sam.dropna()
#convert to int object
df_sam[cols] = df_sam[cols].astype(int)

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py:3191: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]

In order to clarify the data for the gender, we will drop U (unknown) and N (not applicable).

df_sam = df_sam[(df_sam["P_SEX"]=="M") | (df_sam["P_SEX"]=="F")]
#more descriptive
df_sam = df_sam.replace({"F": "Female", "M": "Male"}).copy()

Somehow, this data lacks of datetime information which is important for the data analyzing so we would do our best to accommodate.

df_sam['date'] = pd.DatetimeIndex(df_sam['C_YEAR'].map(str) + '-' + df_sam['C_MNTH'].map(str))
df_sam = df_sam.set_index('date').sort_index().copy()

Now the data is ready to be investigated.

df_sam

	C_YEAR	C_MNTH	C_WDAY	C_HOUR	C_SEV	C_VEHS	C_CONF	C_RCFG	C_WTHR	C_RSUR	C_RALN	C_TRAF	V_TYPE	V_YEAR	P_SEX	P_AGE	P_PSN	P_ISEV	P_SAFE	P_USER
date
1999-01-01	1999	1	3	16	2	2	22	3	4	2	1	18	6	1991	Male	3	23	1	2	2
1999-01-01	1999	1	4	14	2	2	33	2	1	1	1	1	1	1998	Female	23	11	2	2	1
1999-01-01	1999	1	2	17	2	2	21	2	1	1	1	18	1	1991	Male	31	11	2	2	1
1999-01-01	1999	1	3	19	2	1	3	5	1	2	4	18	6	1975	Female	34	12	2	1	2
1999-01-01	1999	1	1	9	2	2	21	2	1	5	1	1	6	1997	Male	43	21	1	2	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2014-12-01	2014	12	7	19	2	2	21	1	2	2	1	18	1	2008	Male	28	11	1	2	1
2014-12-01	2014	12	3	17	2	1	4	1	5	2	3	18	1	2003	Female	23	13	2	2	2
2014-12-01	2014	12	6	15	2	3	35	2	1	1	1	1	1	2008	Female	42	13	2	2	2
2014-12-01	2014	12	3	12	2	2	35	2	3	2	1	1	1	2005	Female	39	11	2	2	1
2014-12-01	2014	12	1	15	2	2	36	2	6	2	1	1	1	2007	Female	53	11	2	2	1

56496 rows × 20 columns

Investigation#

Gender, Age, Fatality#

Since our data is larger than the default limit of Altair, will increase by 100,000 (not sure if necessary).

alt.data_transformers.enable('default', max_rows=100000)

DataTransformerRegistry.enable('default')

Following is a histogram showing the number of accidents by age (bin = 10)

df_sam is still big data that we got error, so we have to get another sample, but it is adequate to portrait the entire trend.

dfs_alt = df_sam.sample(10000, random_state = 1)
alt.Chart(dfs_alt).mark_bar().encode(
    x= alt.X("P_AGE",bin = True),
    y="count()",
    color = "P_ISEV:N",
).properties(
    height=350,
    width=350
).facet(
    column="P_SEX"
)

From this, you can notice some points, such that, 20-30 years old has the most number of the accidents and Men has more number of the accident than the women has. In addition, women has higher ratio of the injury than the men. However, men has higher rate of fatality than the women.

Accident Reports vs Day and Hour of the day#

Here are three charts to visualize the accident density by the hour and day of week. First Altair chart is to visualize the number of records for each day and hour of the week. Second Altair chart is to see the data of severity with x-axis of number of records and y-axis of each hours. The third chart is to visualize the severity ratio by days of week and the number of records.

right = alt.Chart(dfs_alt).mark_bar().encode(
    y= "C_HOUR:N",
    x="count()",
    color = alt.Color("P_ISEV:N",scale=alt.Scale(scheme="redpurple" )),
).properties(
    height = 400
)

c_count = alt.Chart(dfs_alt).mark_rect().encode(
    x= "C_WDAY:N",
    y="C_HOUR:N",
    color = alt.Color('count()',scale=alt.Scale(scheme="redpurple" ))
)

c_text = alt.Chart(dfs_alt).mark_text(color="white").encode(
    x="C_WDAY:N",
    y="C_HOUR:N",
    text="count()"
)

center = (c_count+c_text).properties(
    height=400,
    width=300
)

h = alt.hconcat(center,right)

bottom = alt.Chart(dfs_alt).mark_bar().encode(
    x= "C_WDAY:N",
    y="count()",
     color = alt.Color("P_ISEV:N",scale=alt.Scale(scheme="redpurple" )),
).properties(
    width = 300
)

alt.vconcat(h,bottom)

From these charts, we can see that the around 3-5pm has the peak of number of the accidents each day and Friday 5pm has the most number of the accidents from the week.

Machine Learning#

Again, this data frame is mostly categorical values and is challenging to apply machine learning to extract meaningful insights

Severity of the accident (p-isev 1 = no injury 2 = injury 3 = fatal) could be an candidate for a target value predicted by other quantities.

Decision Tree Classifier#

We will apply a simple Decision Tree Classifier model to see if there is any correlation

First, convert gender category to the discrete value.

df_sam["IsFemale"] = df_sam["P_SEX"].map(lambda x: 1 if x == 'Female' else 0)

dclf = DecisionTreeClassifier()
features = [i for i in df_sam.columns if pd.api.types.is_numeric_dtype(df_sam[i]) and i != 'P_ISEV']
X = df_sam[features]
y = df_sam['P_ISEV']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
dclf.fit(X_train, y_train)
dclf.score(X_test, y_test)

0.5679646017699115

We should not be executing the learning without the treatments of categorical values. But, our point here is to show that the accuracy of our model seems very low

You can see below that the level of severity overlaps almost completely and hard to classify.

alt.Chart(dfs_alt).mark_circle().encode(
    x="P_AGE",
    y=alt.Y("V_YEAR", scale=alt.Scale(zero=False)),
    color = alt.Color("P_ISEV:N",scale=alt.Scale(scheme="redpurple" ))
)

It does only slightly better than this randomly-trained model

y_shuffle = y_train.sample(frac = 1)
dclf.fit(X_train, y_shuffle)
dclf.score(X_test, y_test)

0.5010619469026548

However, we could possibly argue that some of the features actually helped the model to predict the extra percent accuracy by using the feature importance attribute.

pd.Series(dclf.feature_importances_, index=dclf.feature_names_in_).sort_values(ascending=False)

P_AGE       0.162755
V_YEAR      0.139973
C_HOUR      0.117778
C_YEAR      0.107271
C_MNTH      0.080865
C_WDAY      0.071389
C_CONF      0.069449
C_VEHS      0.035414
C_RCFG      0.030232
C_TRAF      0.030189
C_RALN      0.030139
C_WTHR      0.029876
IsFemale    0.025789
V_TYPE      0.020826
C_RSUR      0.016619
P_PSN       0.015850
P_SAFE      0.009067
P_USER      0.003356
C_SEV       0.003162
dtype: float64

Lets use the top 3 features and apply onehotencoder so that the categorical values can be processed.

encoder = OneHotEncoder()
encoder.fit(df_sam[["C_WDAY", "C_MNTH",'C_CONF']])
df_dec = df_sam.copy()
df_dec[list(encoder.get_feature_names_out())] = encoder.transform(df_sam[["C_WDAY", "C_MNTH",'C_CONF']]).toarray()

Other columns are still included but their importance is so low that we can ignore these.

features = [i for i in df_dec.columns if pd.api.types.is_numeric_dtype(df_dec[i]) and i != 'P_ISEV']
X = df_dec[features]
y = df_dec['P_ISEV']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
dclf.fit(X_train, y_train)
dclf.score(X_test, y_test)

0.5669911504424778

The outcome for this onehotencoder seems unchanged compare to the decision tree classifier, therefore we can conclude that it is hard to predict the severity of the accidents by decision tree classifier.

K Nearest Neighbor Classifier#

Let us again perform a simple classifier model, however this time we use knn classifier with smaller features.

kclf = KNeighborsClassifier(n_neighbors=10)
features = ['P_AGE','V_YEAR','C_YEAR','C_HOUR']
X = df_sam[features]
y = df_sam['P_ISEV']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
kclf.fit(X_train, y_train)
kclf.score(X_test, y_test)

0.5263716814159292

The result doesn’t seem very good either, but not sure if our choice of neighbors was optimal.

Below we compared the test and train scores of our model for each k-neighbors.

df_scores = pd.DataFrame({"k":range(1,100),"train_score":np.nan,"test_score":np.nan})
#original df_sam takes too much time to run
df_ss = df_sam.sample(1000,random_state=1).copy()
X = df_ss[features]
y = df_ss['P_ISEV']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
kclf.fit(X_train, y_train)
for i in df_scores.index:
    kclf = KNeighborsClassifier(n_neighbors=i+1)
    kclf.fit(X_train, y_train)
    df_scores.loc[i,["train_score","test_score"]] = [kclf.score(X_train, y_train),kclf.score(X_test, y_test)]

df_scores["kinv"] = 1/df_scores.k
ctrain = alt.Chart(df_scores).mark_line(color = 'black').encode(
    x = "kinv",
    y = "train_score"
)
ctest = alt.Chart(df_scores).mark_line(color="#FF007F").encode(
    x = "kinv",
    y = "test_score"
)

ctrain+ctest

We see that the while train score improves the test score remains flat, so around k = 20 is an optimal choice, yet accuracy is still 55%, in which it is hard to predicts the correction of severity.

Let us investigate other target values

kclf = KNeighborsClassifier(n_neighbors = 20)
X = df_sam[features]
y1 = df_sam['V_TYPE']
X_train, X_test, y1_train, y1_test = train_test_split(X, y1, train_size=0.8)
kclf.fit(X_train, y1_train)
kclf.score(X_test, y1_test)

0.9032743362831859

the knn classifier predicts the vehicle type of accidents by 90%. but is this any better prediction than the severity?

If we look at the occurrence of accidents by the vehicle type by utilizing the value_count method, we see that 1(light duty vehicle) is around 90% of the entire accident cases, so any other random prediction will likely score the similar accuracy.

df_sam["V_TYPE"].value_counts()

   50980
    1727
    966
     708
     667
     617
    497
     211
     51
     25
     20
     14
     13
Name: V_TYPE, dtype: int64

Our conclusion for the classifier part is that we were unable to extract strong correlation predicting the severity of the accident and most other categorical values as well.

Linear and Polynomial Regressor#

What we could do instead is predicting the occurrences of accidents given some conditions.

Here is the simplified data we will use for our regression model.

df_lin = pd.DataFrame(df_sam.index.value_counts()).sort_index().rename(columns={"date": "count"}).reset_index().copy()
df_lin['month'] = df_lin['index'].dt.month
df_lin['year'] = df_lin['index'].dt.year
df_lin

	index	count	month	year
0	1999-01-01	309	1	1999
1	1999-02-01	230	2	1999
2	1999-03-01	259	3	1999
3	1999-04-01	258	4	1999
4	1999-05-01	346	5	1999
...	...	...	...	...
187	2014-08-01	235	8	2014
188	2014-09-01	231	9	2014
189	2014-10-01	280	10	2014
190	2014-11-01	251	11	2014
191	2014-12-01	253	12	2014

192 rows × 4 columns

alt.Chart(df_lin).mark_line(color = 'grey').encode(
    x="index",
    y=alt.Y('count:Q',
        scale=alt.Scale(zero=False)),
).properties(
    height=400,
    width=700
)

From this chart, we can see a clear trend that each year has the similar cycle by month and average of the cases is decreasing annually.

Let us visualize the trend in the following way

trend = alt.Chart(df_lin).mark_line().encode(
    x="month:O",
    y=alt.Y('count:Q',
        scale=alt.Scale(zero=False)),
         color = alt.Color("year:O",scale=alt.Scale(scheme="greys" ))
).properties(
    height=400,
    width=700
)
#average trend
mean = alt.Chart(df_lin).mark_line(color='#FF007F').encode(
    x='month:O',
    y='mean(count)'
)
trend+mean

We see that the the number of accidents are at lowest at April and peaks around fall-winter annually.

The cyclic pattern should be better fitted by polynomial regression of some degree rather than a linear graph

We will visualize the loss for each of d-degree polynomial regression.

from sklearn.metrics import mean_squared_error
train_dict={}
test_dict={}

X_train, X_test, y_train, y_test = train_test_split(
    df_lin[['year',"month"]],
    df_lin['count'],
    train_size=0.8,
    random_state=1
)
for n in range(1,20):
    reg = LinearRegression()
    
    for i in range (2,n+1):
        X_train['m'+str(i)] = X_train['month']**i
        X_test['m'+str(i)] = X_test['month']**i
        
    reg.fit(X_train, y_train)
    train_error = mean_squared_error(y_train, reg.predict(X_train))
    train_dict[n] = train_error
    test_error = mean_squared_error(y_test, reg.predict(X_test))
    test_dict[n] = test_error

#creating a dataframe for altair chart
train_ser = pd.Series(train_dict)
test_ser = pd.Series(test_dict)
train_ser.name = "train"
test_ser.name = "test"
df_loss = pd.concat((train_ser, test_ser), axis=1)
df_loss.reset_index(inplace=True)
df_loss.rename({"index": "poly_degree"}, axis=1, inplace=True)
df_melted = df_loss.melt(id_vars="poly_degree", var_name="Type", value_name="Loss")

alt.Chart(df_melted).mark_line().encode(
    x="poly_degree",
    y=alt.Y('Loss',
        scale=alt.Scale(zero=False)),
    color = alt.Color("Type",scale=alt.Scale(scheme="redpurple" ))
)

we see that the loss and the difference of losses are minimum around d=4-14, so we choose d = 4 for efficiency.

cols = ['year']
for i in range (1,5):
        df_lin['m'+str(i)] = df_lin['month']**i
        cols.append('m'+str(i))

reg = LinearRegression()
reg.fit(df_lin[cols],df_lin['count'])
df_lin["Pred"] = reg.predict(df_lin[cols])

base = alt.Chart(df_lin).mark_line(color = 'grey').encode(
    x="index:T",
    y=alt.Y('count:Q',
        scale=alt.Scale(zero=False)),
)

pred = alt.Chart(df_lin).mark_line(color = '#FF007F').encode(
    x="index:T",
    y=alt.Y('Pred:Q',
        scale=alt.Scale(zero=False)),
)

(base+pred).properties(
    height=400,
    width=700
)

Here we can see that our polynomial model predicts the count value accurately.

pd.Series(reg.coef_,reg.feature_names_in_)

year     -6.800735
m1     -167.042539
m2       49.634385
m3       -5.241706
m4        0.183642
dtype: float64

If we look at the coefficients we see that m4 is already very small and not requiring higher degree coefficients.

K-Means Clustering#

now we understood the overall trends of the car accidents in macroscopic.

we would like to investigate for smaller scope of view, however this data does not provide the date of an accident.

here we extracted all incidents occured in 2014.

df_y = df[df['C_YEAR']==2014].copy()
cols = [c for c in df_y.columns if c != "P_SEX"]
df_y[cols] = df_y[cols].apply(pd.to_numeric, errors='coerce')
df_y = df_y.dropna()
df_y[cols] = df_y[cols].astype(int)
df_y = df_y[(df_y["P_SEX"]=="M") | (df_y["P_SEX"]=="F")]
df_y

We take a similar approaches as the beginning to clean the data

To analyze the data by timescale, we used k-means clustering to artificially assign a date from 1-28 to each incidents. The clustering is done using the day of week and the weather condition, because it makes sense that the incidents with equal day of the week and the weather condition to have a high chance of them being the same day.

kmeans = KMeans(n_clusters = 28)
dcol = ['C_WDAY','C_WTHR']
kmeans.fit(df_y[dcol])
df_y["cluster"] = kmeans.predict(df_y[dcol])+1
dfy_alt = df_y.sample(20000, random_state = 1)
alt.Chart(dfy_alt).mark_circle(size = 300).encode(
    x = "C_WDAY:N",
    y = "C_WTHR:O",
    color = "cluster:N"
).properties(
    width=400,
    height = 400
)

Now we can investigate the frequency pattern with our artificial date-time values.

df_y['date'] = pd.DatetimeIndex(df_y['C_YEAR'].map(str) + '-' + df_y['C_MNTH'].map(str) + '-' + df_y['cluster'].map(str)+ ' ' + df_y['C_HOUR'].map(str)+':00')
df_y = df_y.set_index('date').sort_index().copy()
df_y['C_WDAY'] = df_y['C_WDAY'].replace({1: "Monday", 2: "Tuesday",3: "Wednesday", 4: "Thursday",5: "Friday", 6: "Saturday",7: "Sunday"}).copy()
df_y['C_WTHR'] = df_y['C_WTHR'].replace({1: "Sunny", 2: "Cloudy",3: "Raining", 4: "Snowing",5: "Hail", 6: "Fog", 7: "Windy"}).copy()
df_y

	C_YEAR	C_MNTH	C_WDAY	C_HOUR	C_SEV	C_VEHS	C_CONF	C_RCFG	C_WTHR	C_RSUR	...	V_TYPE	V_YEAR	P_ID	P_SEX	P_AGE	P_PSN	P_ISEV	P_SAFE	P_USER	cluster
date
2014-01-01 00:00:00	2014	1	Sunday	0	2	2	35	2	Cloudy	5	...	1	2003	1	M	23	11	2	12	1	1
2014-01-01 00:00:00	2014	1	Sunday	0	2	2	21	1	Cloudy	1	...	1	2011	1	M	42	11	2	2	1	1
2014-01-01 00:00:00	2014	1	Sunday	0	2	2	21	1	Cloudy	1	...	1	2003	2	M	19	13	2	2	2	1
2014-01-01 00:00:00	2014	1	Sunday	0	2	2	35	2	Cloudy	3	...	1	2007	1	F	30	11	1	2	1	1
2014-01-01 00:00:00	2014	1	Sunday	0	2	2	35	2	Cloudy	3	...	1	2007	2	M	39	12	2	2	2	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2014-12-28 22:00:00	2014	12	Tuesday	22	2	1	6	1	Snowing	3	...	1	2010	1	M	21	11	2	2	1	28
2014-12-28 23:00:00	2014	12	Tuesday	23	2	1	4	1	Snowing	4	...	1	2005	1	M	18	11	2	2	1	28
2014-12-28 23:00:00	2014	12	Tuesday	23	2	1	6	2	Snowing	5	...	1	2007	1	M	21	11	2	2	1	28
2014-12-28 23:00:00	2014	12	Tuesday	23	2	2	21	1	Snowing	3	...	1	2006	1	F	19	11	2	2	1	28
2014-12-28 23:00:00	2014	12	Tuesday	23	2	1	3	1	Hail	5	...	1	2005	1	F	24	11	2	2	1	28

166317 rows × 23 columns

Though we shouldn’t be looking at individual values too closely since it’s artificial, we can definitely see the trends by the day of week, hour, and monthly.

sel = alt.selection_multi(fields=["C_WDAY"], empty="none")
base = alt.Chart(df_y.sample(20000).reset_index()).mark_line().encode(
    x="date:T",
    y=alt.Y('count():Q',
        scale=alt.Scale(zero=False)),
        tooltip=["date", "count()", "C_WDAY", "C_WTHR"],
        opacity=alt.condition(sel, alt.value(1.5), alt.value(0.2)),
        color = 'C_WDAY:N'
).properties(
    height=400,
    width=700
).interactive().add_selection(sel)

text = alt.Chart(df_y.sample(20000)).mark_text(y=20, size=20).encode(
    text="C_WDAY",
    opacity=alt.condition(sel, alt.value(1), alt.value(0))
)

c = base+text
c

If we look closely to the interactive chart above, we can see that the clustering separated the day of the week fairly accurately, where each color is repeating frequencies by 7 days. In addition, we can see the drop in April and peak in autumn, peak on Friday and at 5pm in which, we saw earlier in this project.

We used the weather condition to cluster the dates, but it turned out that the weather condition has strong correlation with the number of incidents.

sel2 = alt.selection_multi(fields=["C_WTHR"], empty="none")
base = alt.Chart(dfy.sample(20000).reset_index()).mark_line().encode(
    x="date:T",
    y=alt.Y('count():Q',
        scale=alt.Scale(zero=False)),
        opacity=alt.condition(sel2, alt.value(1), alt.value(0.5)),
        color = alt.Color('C_WTHR:N', scale = alt.Scale(scheme = 'Category10'))
).properties(
    height=400,
    width=700
).interactive().add_selection(sel2)

text = alt.Chart(dfy.sample(20000)).mark_text(y=20, size=20).encode(
    text="C_WTHR",
    opacity=alt.condition(sel2, alt.value(1), alt.value(0))
)

c = base+text
c

It is obvious that the sunny is most occurred and evenly spread out, and we see that the snowing occurs in winter and has correlation with high count of accidents. We also see that for fog condition, the phase is tall and narrow which implies high number of accidents on foggy days.

Logistic Regression#

Finally we will use Logistic Regression to see if we can predict weather or not the accident was fatal.

df_log = df_sam.copy()
df_log['Is_Fatal'] = df_log["P_ISEV"].map(lambda x: 1 if x == 3 else 0)
lclf = LogisticRegression()
features = [i for i in df_log.columns if pd.api.types.is_numeric_dtype(df_log[i]) and i != 'P_ISEV' and i != 'Is_Fatal' and i != 'C_SEV']
X = df_log[features]
y = df_log['Is_Fatal']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)
lclf.fit(X_train, y_train)
lclf.score(X_train, y_train),lclf.score(X_test, y_test)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,

(0.9934064961501018, 0.994070796460177)

At first, it seemed like we finally reached something meaningful in our paper. However, if we look at the percentage of fatal accidents, it is actually less than 1% of the total incidents which tells us that out model does not predict fatality any better than random prediction.

df_log['Is_Fatal'].value_counts()

0    56131
1      365
Name: Is_Fatal, dtype: int64

Summary#

For this project, we begin with cleaning the data, and then explore it by using the pandas method. For the machine leaning, we choose five topics such as Decision Tree Classifier, KNN Classifier, Linear and Polynomial Regrresor, K_Mean Clustering, and Logistic Regression. From these results, we can conclude that the this data that I got from Kaggle was hard to predict the data since it mostly contains the categorial values.