Prediction of Canadian Car Accident
Contents
Prediction of Canadian Car Accident#
Author: Nanako Tatewaki
59793326
Course Project, UC Irvine, Math 10, F22
Introduction#
I would like to use the “Canadian Car Accidents 1994-2014” dataset from the Kaggle (which we haven’t utilize this dataset for this class) to predict the data of the accident (gender, weather condition, road condition,etc…). In this project, I will utilize some method that we covered from the lecture, such as scikit-learn, Altair chart, classifier, and more for the machine learning. And for the extra topic, I choose the K Nearest Neighbor Classifier and Logistic Regression.
Importing Files#
Dataset from the Kaggle#
import libraries and the dataset
import pandas as pd
import numpy as np
import seaborn as sns
import altair as alt
import pandas as pd
import zipfile
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, log_loss
import plotly.express as px
from sklearn.tree import DecisionTreeClassifier
from pandas.api.types import is_numeric_dtype
from sklearn.preprocessing import OneHotEncoder
from sklearn.cluster import KMeans
df = pd.read_csv("caraccidents.zip",compression='zip')
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3258: DtypeWarning: Columns (1,2,5,12) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
Clean Data#
Since the Data is too huge, we will choose 100,000 data randomly and clean the data to utilize later on.
df.shape
(5860405, 22)
df_sam = df.sample(100000, random_state = 1).copy()
#drop some unnecessary columns
df_sam = df_sam.drop(columns=['V_ID', 'P_ID'])
df_sam
C_YEAR | C_MNTH | C_WDAY | C_HOUR | C_SEV | C_VEHS | C_CONF | C_RCFG | C_WTHR | C_RSUR | C_RALN | C_TRAF | V_TYPE | V_YEAR | P_SEX | P_AGE | P_PSN | P_ISEV | P_SAFE | P_USER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4348044 | 2010 | 4 | 5 | 16 | 2 | 2 | 35 | 02 | 1 | 1 | 1 | 03 | 01 | 1997 | F | 38 | 11 | 2 | 02 | 1 |
1937686 | 2003 | 9 | 1 | 08 | 2 | 02 | 35 | 03 | 1 | 1 | 1 | 18 | 01 | 1994 | F | 43 | 11 | 2 | 02 | 1 |
4916678 | 2011 | 12 | 6 | 12 | 2 | 01 | 02 | 02 | 1 | 1 | 1 | 01 | NN | NNNN | F | 39 | 99 | 2 | NN | 3 |
2881117 | 2006 | 2 | 2 | 17 | 2 | 02 | 35 | 01 | 4 | 2 | 1 | 18 | 08 | 2005 | M | 38 | 11 | N | NN | 1 |
1058840 | 2001 | 7 | 7 | 17 | 2 | 02 | 21 | 01 | 1 | 1 | 2 | 18 | 17 | UUUU | M | 55 | 11 | 1 | 09 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2352110 | 2004 | 9 | 6 | 15 | 2 | 02 | 21 | 01 | 1 | 1 | 1 | 18 | 01 | 1989 | F | 65 | 13 | 1 | 02 | 2 |
5658028 | 2014 | 5 | 2 | 18 | 2 | 2 | 36 | 02 | 1 | 1 | 1 | 03 | 11 | 2004 | M | 58 | 11 | 2 | 02 | 1 |
2998412 | 2006 | 6 | 3 | 06 | 2 | 3 | 21 | 01 | 1 | 1 | 1 | 18 | 01 | 1996 | M | 37 | 11 | 2 | 02 | 1 |
3659566 | 2008 | 3 | 3 | 08 | 2 | 2 | 21 | 02 | 3 | 2 | 3 | 04 | 01 | 2002 | M | 30 | 11 | 1 | 02 | 1 |
5718138 | 2014 | 7 | 5 | 16 | 2 | 1 | 06 | 03 | 1 | 1 | 1 | 18 | 01 | 2003 | F | 24 | 11 | 1 | NN | 1 |
100000 rows × 20 columns
Now, we will explore the interaction of gender and age. First, we will figure out what kind of genders are in the column of “P_SEX”, and what ages are in the column of “P_AGE” by using the value counts method.
Most of the data seems numeric, however, most of the columns are categorical value explained in the original PDF source.
The data frame contains missing values and uneven object types .
#U = ,N =
df_sam["P_SEX"].value_counts()
M 54190
F 41513
U 4080
N 217
Name: P_SEX, dtype: int64
#contains '12' and 12
df_sam["C_MNTH"].unique()
array([4, 9, '12', 2, 7, 10, 6, 11, '01', 5, 8, 3, 12, 1, 'UU', '02',
'11'], dtype=object)
Convert all the columns except for gender to numeric object.
cols = [c for c in df_sam.columns if c != "P_SEX"]
Remove the entries which couldn’t be converted = missing value
df_sam[cols] = df_sam[cols].apply(pd.to_numeric, errors='coerce')
df_sam = df_sam.dropna()
#convert to int object
df_sam[cols] = df_sam[cols].astype(int)
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py:3191: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[k1] = value[k2]
In order to clarify the data for the gender, we will drop U (unknown) and N (not applicable).
df_sam = df_sam[(df_sam["P_SEX"]=="M") | (df_sam["P_SEX"]=="F")]
#more descriptive
df_sam = df_sam.replace({"F": "Female", "M": "Male"}).copy()
Somehow, this data lacks of datetime information which is important for the data analyzing so we would do our best to accommodate.
df_sam['date'] = pd.DatetimeIndex(df_sam['C_YEAR'].map(str) + '-' + df_sam['C_MNTH'].map(str))
df_sam = df_sam.set_index('date').sort_index().copy()
Now the data is ready to be investigated.
df_sam
C_YEAR | C_MNTH | C_WDAY | C_HOUR | C_SEV | C_VEHS | C_CONF | C_RCFG | C_WTHR | C_RSUR | C_RALN | C_TRAF | V_TYPE | V_YEAR | P_SEX | P_AGE | P_PSN | P_ISEV | P_SAFE | P_USER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||||||||||||
1999-01-01 | 1999 | 1 | 3 | 16 | 2 | 2 | 22 | 3 | 4 | 2 | 1 | 18 | 6 | 1991 | Male | 3 | 23 | 1 | 2 | 2 |
1999-01-01 | 1999 | 1 | 4 | 14 | 2 | 2 | 33 | 2 | 1 | 1 | 1 | 1 | 1 | 1998 | Female | 23 | 11 | 2 | 2 | 1 |
1999-01-01 | 1999 | 1 | 2 | 17 | 2 | 2 | 21 | 2 | 1 | 1 | 1 | 18 | 1 | 1991 | Male | 31 | 11 | 2 | 2 | 1 |
1999-01-01 | 1999 | 1 | 3 | 19 | 2 | 1 | 3 | 5 | 1 | 2 | 4 | 18 | 6 | 1975 | Female | 34 | 12 | 2 | 1 | 2 |
1999-01-01 | 1999 | 1 | 1 | 9 | 2 | 2 | 21 | 2 | 1 | 5 | 1 | 1 | 6 | 1997 | Male | 43 | 21 | 1 | 2 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2014-12-01 | 2014 | 12 | 7 | 19 | 2 | 2 | 21 | 1 | 2 | 2 | 1 | 18 | 1 | 2008 | Male | 28 | 11 | 1 | 2 | 1 |
2014-12-01 | 2014 | 12 | 3 | 17 | 2 | 1 | 4 | 1 | 5 | 2 | 3 | 18 | 1 | 2003 | Female | 23 | 13 | 2 | 2 | 2 |
2014-12-01 | 2014 | 12 | 6 | 15 | 2 | 3 | 35 | 2 | 1 | 1 | 1 | 1 | 1 | 2008 | Female | 42 | 13 | 2 | 2 | 2 |
2014-12-01 | 2014 | 12 | 3 | 12 | 2 | 2 | 35 | 2 | 3 | 2 | 1 | 1 | 1 | 2005 | Female | 39 | 11 | 2 | 2 | 1 |
2014-12-01 | 2014 | 12 | 1 | 15 | 2 | 2 | 36 | 2 | 6 | 2 | 1 | 1 | 1 | 2007 | Female | 53 | 11 | 2 | 2 | 1 |
56496 rows × 20 columns
Investigation#
Gender, Age, Fatality#
Since our data is larger than the default limit of Altair, will increase by 100,000 (not sure if necessary).
alt.data_transformers.enable('default', max_rows=100000)
DataTransformerRegistry.enable('default')
Following is a histogram showing the number of accidents by age (bin = 10)
df_sam is still big data that we got error, so we have to get another sample, but it is adequate to portrait the entire trend.
dfs_alt = df_sam.sample(10000, random_state = 1)
alt.Chart(dfs_alt).mark_bar().encode(
x= alt.X("P_AGE",bin = True),
y="count()",
color = "P_ISEV:N",
).properties(
height=350,
width=350
).facet(
column="P_SEX"
)
From this, you can notice some points, such that, 20-30 years old has the most number of the accidents and Men has more number of the accident than the women has. In addition, women has higher ratio of the injury than the men. However, men has higher rate of fatality than the women.
Accident Reports vs Day and Hour of the day#
Here are three charts to visualize the accident density by the hour and day of week. First Altair chart is to visualize the number of records for each day and hour of the week. Second Altair chart is to see the data of severity with x-axis of number of records and y-axis of each hours. The third chart is to visualize the severity ratio by days of week and the number of records.
right = alt.Chart(dfs_alt).mark_bar().encode(
y= "C_HOUR:N",
x="count()",
color = alt.Color("P_ISEV:N",scale=alt.Scale(scheme="redpurple" )),
).properties(
height = 400
)
c_count = alt.Chart(dfs_alt).mark_rect().encode(
x= "C_WDAY:N",
y="C_HOUR:N",
color = alt.Color('count()',scale=alt.Scale(scheme="redpurple" ))
)
c_text = alt.Chart(dfs_alt).mark_text(color="white").encode(
x="C_WDAY:N",
y="C_HOUR:N",
text="count()"
)
center = (c_count+c_text).properties(
height=400,
width=300
)
h = alt.hconcat(center,right)
bottom = alt.Chart(dfs_alt).mark_bar().encode(
x= "C_WDAY:N",
y="count()",
color = alt.Color("P_ISEV:N",scale=alt.Scale(scheme="redpurple" )),
).properties(
width = 300
)
alt.vconcat(h,bottom)
From these charts, we can see that the around 3-5pm has the peak of number of the accidents each day and Friday 5pm has the most number of the accidents from the week.
Machine Learning#
Again, this data frame is mostly categorical values and is challenging to apply machine learning to extract meaningful insights
Severity of the accident (p-isev 1 = no injury 2 = injury 3 = fatal) could be an candidate for a target value predicted by other quantities.
Decision Tree Classifier#
We will apply a simple Decision Tree Classifier model to see if there is any correlation
First, convert gender category to the discrete value.
df_sam["IsFemale"] = df_sam["P_SEX"].map(lambda x: 1 if x == 'Female' else 0)
dclf = DecisionTreeClassifier()
features = [i for i in df_sam.columns if pd.api.types.is_numeric_dtype(df_sam[i]) and i != 'P_ISEV']
X = df_sam[features]
y = df_sam['P_ISEV']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
dclf.fit(X_train, y_train)
dclf.score(X_test, y_test)
0.5679646017699115
We should not be executing the learning without the treatments of categorical values. But, our point here is to show that the accuracy of our model seems very low
You can see below that the level of severity overlaps almost completely and hard to classify.
alt.Chart(dfs_alt).mark_circle().encode(
x="P_AGE",
y=alt.Y("V_YEAR", scale=alt.Scale(zero=False)),
color = alt.Color("P_ISEV:N",scale=alt.Scale(scheme="redpurple" ))
)
It does only slightly better than this randomly-trained model
y_shuffle = y_train.sample(frac = 1)
dclf.fit(X_train, y_shuffle)
dclf.score(X_test, y_test)
0.5010619469026548
However, we could possibly argue that some of the features actually helped the model to predict the extra percent accuracy by using the feature importance attribute.
pd.Series(dclf.feature_importances_, index=dclf.feature_names_in_).sort_values(ascending=False)
P_AGE 0.162755
V_YEAR 0.139973
C_HOUR 0.117778
C_YEAR 0.107271
C_MNTH 0.080865
C_WDAY 0.071389
C_CONF 0.069449
C_VEHS 0.035414
C_RCFG 0.030232
C_TRAF 0.030189
C_RALN 0.030139
C_WTHR 0.029876
IsFemale 0.025789
V_TYPE 0.020826
C_RSUR 0.016619
P_PSN 0.015850
P_SAFE 0.009067
P_USER 0.003356
C_SEV 0.003162
dtype: float64
Lets use the top 3 features and apply onehotencoder so that the categorical values can be processed.
encoder = OneHotEncoder()
encoder.fit(df_sam[["C_WDAY", "C_MNTH",'C_CONF']])
df_dec = df_sam.copy()
df_dec[list(encoder.get_feature_names_out())] = encoder.transform(df_sam[["C_WDAY", "C_MNTH",'C_CONF']]).toarray()
Other columns are still included but their importance is so low that we can ignore these.
features = [i for i in df_dec.columns if pd.api.types.is_numeric_dtype(df_dec[i]) and i != 'P_ISEV']
X = df_dec[features]
y = df_dec['P_ISEV']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
dclf.fit(X_train, y_train)
dclf.score(X_test, y_test)
0.5669911504424778
The outcome for this onehotencoder seems unchanged compare to the decision tree classifier, therefore we can conclude that it is hard to predict the severity of the accidents by decision tree classifier.
K Nearest Neighbor Classifier#
Let us again perform a simple classifier model, however this time we use knn classifier with smaller features.
kclf = KNeighborsClassifier(n_neighbors=10)
features = ['P_AGE','V_YEAR','C_YEAR','C_HOUR']
X = df_sam[features]
y = df_sam['P_ISEV']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
kclf.fit(X_train, y_train)
kclf.score(X_test, y_test)
0.5263716814159292
The result doesn’t seem very good either, but not sure if our choice of neighbors was optimal.
Below we compared the test and train scores of our model for each k-neighbors.
df_scores = pd.DataFrame({"k":range(1,100),"train_score":np.nan,"test_score":np.nan})
#original df_sam takes too much time to run
df_ss = df_sam.sample(1000,random_state=1).copy()
X = df_ss[features]
y = df_ss['P_ISEV']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
kclf.fit(X_train, y_train)
for i in df_scores.index:
kclf = KNeighborsClassifier(n_neighbors=i+1)
kclf.fit(X_train, y_train)
df_scores.loc[i,["train_score","test_score"]] = [kclf.score(X_train, y_train),kclf.score(X_test, y_test)]
df_scores["kinv"] = 1/df_scores.k
ctrain = alt.Chart(df_scores).mark_line(color = 'black').encode(
x = "kinv",
y = "train_score"
)
ctest = alt.Chart(df_scores).mark_line(color="#FF007F").encode(
x = "kinv",
y = "test_score"
)
ctrain+ctest
We see that the while train score improves the test score remains flat, so around k = 20 is an optimal choice, yet accuracy is still 55%, in which it is hard to predicts the correction of severity.
Let us investigate other target values
kclf = KNeighborsClassifier(n_neighbors = 20)
X = df_sam[features]
y1 = df_sam['V_TYPE']
X_train, X_test, y1_train, y1_test = train_test_split(X, y1, train_size=0.8)
kclf.fit(X_train, y1_train)
kclf.score(X_test, y1_test)
0.9032743362831859
the knn classifier predicts the vehicle type of accidents by 90%. but is this any better prediction than the severity?
If we look at the occurrence of accidents by the vehicle type by utilizing the value_count method, we see that 1(light duty vehicle) is around 90% of the entire accident cases, so any other random prediction will likely score the similar accuracy.
df_sam["V_TYPE"].value_counts()
1 50980
6 1727
14 966
5 708
7 667
8 617
11 497
9 211
17 51
23 25
18 20
10 14
21 13
Name: V_TYPE, dtype: int64
Our conclusion for the classifier part is that we were unable to extract strong correlation predicting the severity of the accident and most other categorical values as well.
Linear and Polynomial Regressor#
What we could do instead is predicting the occurrences of accidents given some conditions.
Here is the simplified data we will use for our regression model.
df_lin = pd.DataFrame(df_sam.index.value_counts()).sort_index().rename(columns={"date": "count"}).reset_index().copy()
df_lin['month'] = df_lin['index'].dt.month
df_lin['year'] = df_lin['index'].dt.year
df_lin
index | count | month | year | |
---|---|---|---|---|
0 | 1999-01-01 | 309 | 1 | 1999 |
1 | 1999-02-01 | 230 | 2 | 1999 |
2 | 1999-03-01 | 259 | 3 | 1999 |
3 | 1999-04-01 | 258 | 4 | 1999 |
4 | 1999-05-01 | 346 | 5 | 1999 |
... | ... | ... | ... | ... |
187 | 2014-08-01 | 235 | 8 | 2014 |
188 | 2014-09-01 | 231 | 9 | 2014 |
189 | 2014-10-01 | 280 | 10 | 2014 |
190 | 2014-11-01 | 251 | 11 | 2014 |
191 | 2014-12-01 | 253 | 12 | 2014 |
192 rows × 4 columns
alt.Chart(df_lin).mark_line(color = 'grey').encode(
x="index",
y=alt.Y('count:Q',
scale=alt.Scale(zero=False)),
).properties(
height=400,
width=700
)
From this chart, we can see a clear trend that each year has the similar cycle by month and average of the cases is decreasing annually.
Let us visualize the trend in the following way
trend = alt.Chart(df_lin).mark_line().encode(
x="month:O",
y=alt.Y('count:Q',
scale=alt.Scale(zero=False)),
color = alt.Color("year:O",scale=alt.Scale(scheme="greys" ))
).properties(
height=400,
width=700
)
#average trend
mean = alt.Chart(df_lin).mark_line(color='#FF007F').encode(
x='month:O',
y='mean(count)'
)
trend+mean
We see that the the number of accidents are at lowest at April and peaks around fall-winter annually.
The cyclic pattern should be better fitted by polynomial regression of some degree rather than a linear graph
We will visualize the loss for each of d-degree polynomial regression.
from sklearn.metrics import mean_squared_error
train_dict={}
test_dict={}
X_train, X_test, y_train, y_test = train_test_split(
df_lin[['year',"month"]],
df_lin['count'],
train_size=0.8,
random_state=1
)
for n in range(1,20):
reg = LinearRegression()
for i in range (2,n+1):
X_train['m'+str(i)] = X_train['month']**i
X_test['m'+str(i)] = X_test['month']**i
reg.fit(X_train, y_train)
train_error = mean_squared_error(y_train, reg.predict(X_train))
train_dict[n] = train_error
test_error = mean_squared_error(y_test, reg.predict(X_test))
test_dict[n] = test_error
#creating a dataframe for altair chart
train_ser = pd.Series(train_dict)
test_ser = pd.Series(test_dict)
train_ser.name = "train"
test_ser.name = "test"
df_loss = pd.concat((train_ser, test_ser), axis=1)
df_loss.reset_index(inplace=True)
df_loss.rename({"index": "poly_degree"}, axis=1, inplace=True)
df_melted = df_loss.melt(id_vars="poly_degree", var_name="Type", value_name="Loss")
alt.Chart(df_melted).mark_line().encode(
x="poly_degree",
y=alt.Y('Loss',
scale=alt.Scale(zero=False)),
color = alt.Color("Type",scale=alt.Scale(scheme="redpurple" ))
)
we see that the loss and the difference of losses are minimum around d=4-14, so we choose d = 4 for efficiency.
cols = ['year']
for i in range (1,5):
df_lin['m'+str(i)] = df_lin['month']**i
cols.append('m'+str(i))
reg = LinearRegression()
reg.fit(df_lin[cols],df_lin['count'])
df_lin["Pred"] = reg.predict(df_lin[cols])
base = alt.Chart(df_lin).mark_line(color = 'grey').encode(
x="index:T",
y=alt.Y('count:Q',
scale=alt.Scale(zero=False)),
)
pred = alt.Chart(df_lin).mark_line(color = '#FF007F').encode(
x="index:T",
y=alt.Y('Pred:Q',
scale=alt.Scale(zero=False)),
)
(base+pred).properties(
height=400,
width=700
)
Here we can see that our polynomial model predicts the count value accurately.
pd.Series(reg.coef_,reg.feature_names_in_)
year -6.800735
m1 -167.042539
m2 49.634385
m3 -5.241706
m4 0.183642
dtype: float64
If we look at the coefficients we see that m4 is already very small and not requiring higher degree coefficients.
K-Means Clustering#
now we understood the overall trends of the car accidents in macroscopic.
we would like to investigate for smaller scope of view, however this data does not provide the date of an accident.
here we extracted all incidents occured in 2014.
df_y = df[df['C_YEAR']==2014].copy()
cols = [c for c in df_y.columns if c != "P_SEX"]
df_y[cols] = df_y[cols].apply(pd.to_numeric, errors='coerce')
df_y = df_y.dropna()
df_y[cols] = df_y[cols].astype(int)
df_y = df_y[(df_y["P_SEX"]=="M") | (df_y["P_SEX"]=="F")]
df_y
We take a similar approaches as the beginning to clean the data
To analyze the data by timescale, we used k-means clustering to artificially assign a date from 1-28 to each incidents. The clustering is done using the day of week and the weather condition, because it makes sense that the incidents with equal day of the week and the weather condition to have a high chance of them being the same day.
kmeans = KMeans(n_clusters = 28)
dcol = ['C_WDAY','C_WTHR']
kmeans.fit(df_y[dcol])
df_y["cluster"] = kmeans.predict(df_y[dcol])+1
dfy_alt = df_y.sample(20000, random_state = 1)
alt.Chart(dfy_alt).mark_circle(size = 300).encode(
x = "C_WDAY:N",
y = "C_WTHR:O",
color = "cluster:N"
).properties(
width=400,
height = 400
)
Now we can investigate the frequency pattern with our artificial date-time values.
df_y['date'] = pd.DatetimeIndex(df_y['C_YEAR'].map(str) + '-' + df_y['C_MNTH'].map(str) + '-' + df_y['cluster'].map(str)+ ' ' + df_y['C_HOUR'].map(str)+':00')
df_y = df_y.set_index('date').sort_index().copy()
df_y['C_WDAY'] = df_y['C_WDAY'].replace({1: "Monday", 2: "Tuesday",3: "Wednesday", 4: "Thursday",5: "Friday", 6: "Saturday",7: "Sunday"}).copy()
df_y['C_WTHR'] = df_y['C_WTHR'].replace({1: "Sunny", 2: "Cloudy",3: "Raining", 4: "Snowing",5: "Hail", 6: "Fog", 7: "Windy"}).copy()
df_y
C_YEAR | C_MNTH | C_WDAY | C_HOUR | C_SEV | C_VEHS | C_CONF | C_RCFG | C_WTHR | C_RSUR | ... | V_TYPE | V_YEAR | P_ID | P_SEX | P_AGE | P_PSN | P_ISEV | P_SAFE | P_USER | cluster | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||||||||||
2014-01-01 00:00:00 | 2014 | 1 | Sunday | 0 | 2 | 2 | 35 | 2 | Cloudy | 5 | ... | 1 | 2003 | 1 | M | 23 | 11 | 2 | 12 | 1 | 1 |
2014-01-01 00:00:00 | 2014 | 1 | Sunday | 0 | 2 | 2 | 21 | 1 | Cloudy | 1 | ... | 1 | 2011 | 1 | M | 42 | 11 | 2 | 2 | 1 | 1 |
2014-01-01 00:00:00 | 2014 | 1 | Sunday | 0 | 2 | 2 | 21 | 1 | Cloudy | 1 | ... | 1 | 2003 | 2 | M | 19 | 13 | 2 | 2 | 2 | 1 |
2014-01-01 00:00:00 | 2014 | 1 | Sunday | 0 | 2 | 2 | 35 | 2 | Cloudy | 3 | ... | 1 | 2007 | 1 | F | 30 | 11 | 1 | 2 | 1 | 1 |
2014-01-01 00:00:00 | 2014 | 1 | Sunday | 0 | 2 | 2 | 35 | 2 | Cloudy | 3 | ... | 1 | 2007 | 2 | M | 39 | 12 | 2 | 2 | 2 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2014-12-28 22:00:00 | 2014 | 12 | Tuesday | 22 | 2 | 1 | 6 | 1 | Snowing | 3 | ... | 1 | 2010 | 1 | M | 21 | 11 | 2 | 2 | 1 | 28 |
2014-12-28 23:00:00 | 2014 | 12 | Tuesday | 23 | 2 | 1 | 4 | 1 | Snowing | 4 | ... | 1 | 2005 | 1 | M | 18 | 11 | 2 | 2 | 1 | 28 |
2014-12-28 23:00:00 | 2014 | 12 | Tuesday | 23 | 2 | 1 | 6 | 2 | Snowing | 5 | ... | 1 | 2007 | 1 | M | 21 | 11 | 2 | 2 | 1 | 28 |
2014-12-28 23:00:00 | 2014 | 12 | Tuesday | 23 | 2 | 2 | 21 | 1 | Snowing | 3 | ... | 1 | 2006 | 1 | F | 19 | 11 | 2 | 2 | 1 | 28 |
2014-12-28 23:00:00 | 2014 | 12 | Tuesday | 23 | 2 | 1 | 3 | 1 | Hail | 5 | ... | 1 | 2005 | 1 | F | 24 | 11 | 2 | 2 | 1 | 28 |
166317 rows × 23 columns
Though we shouldn’t be looking at individual values too closely since it’s artificial, we can definitely see the trends by the day of week, hour, and monthly.
sel = alt.selection_multi(fields=["C_WDAY"], empty="none")
base = alt.Chart(df_y.sample(20000).reset_index()).mark_line().encode(
x="date:T",
y=alt.Y('count():Q',
scale=alt.Scale(zero=False)),
tooltip=["date", "count()", "C_WDAY", "C_WTHR"],
opacity=alt.condition(sel, alt.value(1.5), alt.value(0.2)),
color = 'C_WDAY:N'
).properties(
height=400,
width=700
).interactive().add_selection(sel)
text = alt.Chart(df_y.sample(20000)).mark_text(y=20, size=20).encode(
text="C_WDAY",
opacity=alt.condition(sel, alt.value(1), alt.value(0))
)
c = base+text
c