Airline Satisfaction Analysis

Author: Mitra Rezvany

Course Project, UC Irvine, Math 10, S22

Introduction

In this project, I will be using the airline satisfaction dataset imported from Kaggle. The dataset displays customer satisfaction scores from 120,000+ airline passengers, including details about the passenger, their type of travel, and their evaluation of different factors like cleanliness, comfort, and overall experience. The accompanying data_dictionary dataset defines each of these additional variables.

We will start by cleaning the datset using pandas and exploring its factors/variables using plotly and seaborn charts. Then, we will explore if the passenger’s age and the distance their flight travels can predict their satisfaction by using scikit learn’s Logistic Regression. Furthermore, we will use K-Nearest Neighbors Classifier to predict the satisfaction of passengers using factors regarding flight delays from the dataset. We’ll display these findings using interactive altair charts.

Main portion of the project

import pandas as pd
import seaborn as sns
import numpy as np
import altair as alt
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

Import the Data

df = pd.read_csv("airline_satisfaction.csv")
data_dict = pd.read_csv("data_dictionary.csv")
df
ID Gender Age Customer Type Type of Travel Class Flight Distance Departure Delay Arrival Delay Departure and Arrival Time Convenience ... On-board Service Seat Comfort Leg Room Service Cleanliness Food and Drink In-flight Service In-flight Wifi Service In-flight Entertainment Baggage Handling Satisfaction
0 1 Male 48 First-time Business Business 821 2 5.0 3 ... 3 5 2 5 5 5 3 5 5 Neutral or Dissatisfied
1 2 Female 35 Returning Business Business 821 26 39.0 2 ... 5 4 5 5 3 5 2 5 5 Satisfied
2 3 Male 41 Returning Business Business 853 0 0.0 4 ... 3 5 3 5 5 3 4 3 3 Satisfied
3 4 Male 50 Returning Business Business 1905 0 0.0 2 ... 5 5 5 4 4 5 2 5 5 Satisfied
4 5 Female 49 Returning Business Business 3470 0 1.0 3 ... 3 4 4 5 4 3 3 3 3 Satisfied
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
129875 129876 Male 28 Returning Personal Economy Plus 447 2 3.0 4 ... 5 1 4 4 4 5 4 4 4 Neutral or Dissatisfied
129876 129877 Male 41 Returning Personal Economy Plus 308 0 0.0 5 ... 5 2 5 2 2 4 3 2 5 Neutral or Dissatisfied
129877 129878 Male 42 Returning Personal Economy Plus 337 6 14.0 5 ... 3 3 4 3 3 4 2 3 5 Neutral or Dissatisfied
129878 129879 Male 50 Returning Personal Economy Plus 337 31 22.0 4 ... 4 4 5 3 3 4 5 3 5 Satisfied
129879 129880 Female 20 Returning Personal Economy Plus 337 0 0.0 1 ... 4 2 4 2 2 2 3 2 1 Neutral or Dissatisfied

129880 rows × 24 columns

data_dict
Field Description
0 ID Unique passenger identifier
1 Gender Gender of the passenger (Female/Male)
2 Age Age of the passenger
3 Customer Type Type of airline customer (First-time/Returning)
4 Type of Travel Purpose of the flight (Business/Personal)
5 Class Travel class in the airplane for the passenger...
6 Flight Distance Flight distance in miles
7 Departure Delay Flight departure delay in minutes
8 Arrival Delay Flight arrival delay in minutes
9 Departure and Arrival Time Convenience Satisfaction level with the convenience of the...
10 Ease of Online Booking Satisfaction level with the online booking exp...
11 Check-in Service Satisfaction level with the check-in service f...
12 Online Boarding Satisfaction level with the online boarding ex...
13 Gate Location Satisfaction level with the gate location in t...
14 On-board Service Satisfaction level with the on-boarding servic...
15 Seat Comfort Satisfaction level with the comfort of the air...
16 Leg Room Service Satisfaction level with the leg room of the ai...
17 Cleanliness Satisfaction level with the cleanliness of the...
18 Food and Drink Satisfaction level with the food and drinks on...
19 In-flight Service Satisfaction level with the in-flight service ...
20 In-flight Wifi Service Satisfaction level with the in-flight Wifi ser...
21 In-flight Entertainment Satisfaction level with the in-flight entertai...
22 Baggage Handling Satisfaction level with the baggage handling f...
23 Satisfaction Overall satisfaction level with the airline (S...
# info on datatypes and null entries
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 24 columns):
 #   Column                                  Non-Null Count   Dtype  
---  ------                                  --------------   -----  
 0   ID                                      129880 non-null  int64  
 1   Gender                                  129880 non-null  object 
 2   Age                                     129880 non-null  int64  
 3   Customer Type                           129880 non-null  object 
 4   Type of Travel                          129880 non-null  object 
 5   Class                                   129880 non-null  object 
 6   Flight Distance                         129880 non-null  int64  
 7   Departure Delay                         129880 non-null  int64  
 8   Arrival Delay                           129487 non-null  float64
 9   Departure and Arrival Time Convenience  129880 non-null  int64  
 10  Ease of Online Booking                  129880 non-null  int64  
 11  Check-in Service                        129880 non-null  int64  
 12  Online Boarding                         129880 non-null  int64  
 13  Gate Location                           129880 non-null  int64  
 14  On-board Service                        129880 non-null  int64  
 15  Seat Comfort                            129880 non-null  int64  
 16  Leg Room Service                        129880 non-null  int64  
 17  Cleanliness                             129880 non-null  int64  
 18  Food and Drink                          129880 non-null  int64  
 19  In-flight Service                       129880 non-null  int64  
 20  In-flight Wifi Service                  129880 non-null  int64  
 21  In-flight Entertainment                 129880 non-null  int64  
 22  Baggage Handling                        129880 non-null  int64  
 23  Satisfaction                            129880 non-null  object 
dtypes: float64(1), int64(18), object(5)
memory usage: 23.8+ MB
df.isna().any()
ID                                        False
Gender                                    False
Age                                       False
Customer Type                             False
Type of Travel                            False
Class                                     False
Flight Distance                           False
Departure Delay                           False
Arrival Delay                              True
Departure and Arrival Time Convenience    False
Ease of Online Booking                    False
Check-in Service                          False
Online Boarding                           False
Gate Location                             False
On-board Service                          False
Seat Comfort                              False
Leg Room Service                          False
Cleanliness                               False
Food and Drink                            False
In-flight Service                         False
In-flight Wifi Service                    False
In-flight Entertainment                   False
Baggage Handling                          False
Satisfaction                              False
dtype: bool

Although this dataset is mostly clean, df.isna().any() points out that the “Arrival Delay” column contains missing values so we’ll have to drop those from df before exploring the dataset further. We use df.shape to confirm this change in the dataset.

df.shape
(129880, 24)
df.dropna(inplace=True)
print(f"The new number of rows in this dataset is {df.shape[0]}")
The new number of rows in this dataset is 129487

The “ID” column in df seems irrelevant and part of Kaggle data assignment rather than being a column significant to this dataset so we’ll remove it.

df = df.drop("ID", axis=1)
df.head()
Gender Age Customer Type Type of Travel Class Flight Distance Departure Delay Arrival Delay Departure and Arrival Time Convenience Ease of Online Booking ... On-board Service Seat Comfort Leg Room Service Cleanliness Food and Drink In-flight Service In-flight Wifi Service In-flight Entertainment Baggage Handling Satisfaction
0 Male 48 First-time Business Business 821 2 5.0 3 3 ... 3 5 2 5 5 5 3 5 5 Neutral or Dissatisfied
1 Female 35 Returning Business Business 821 26 39.0 2 2 ... 5 4 5 5 3 5 2 5 5 Satisfied
2 Male 41 Returning Business Business 853 0 0.0 4 4 ... 3 5 3 5 5 3 4 3 3 Satisfied
3 Male 50 Returning Business Business 1905 0 0.0 2 2 ... 5 5 5 4 4 5 2 5 5 Satisfied
4 Female 49 Returning Business Business 3470 0 1.0 3 3 ... 3 4 4 5 4 3 3 3 3 Satisfied

5 rows × 23 columns

Visualization of Variables

Numerical Variables: Matplotlib and Seaborn Graphics

# When displaying the count of ages of passengers, the ticks on the x-axis overlap. 
# To avoid this, we change the figure size to 20in wide by 4in high
plt.figure(figsize=(20,4))
sns.countplot(x="Age", data=df)
plt.show()
../../_images/MitraRezvany_18_0.png
# Statistics of Numerical Variables
df.describe()
Age Flight Distance Departure Delay Arrival Delay Departure and Arrival Time Convenience Ease of Online Booking Check-in Service Online Boarding Gate Location On-board Service Seat Comfort Leg Room Service Cleanliness Food and Drink In-flight Service In-flight Wifi Service In-flight Entertainment Baggage Handling
count 129487.000000 129487.000000 129487.000000 129487.000000 129487.000000 129487.000000 129487.000000 129487.000000 129487.000000 129487.000000 129487.000000 129487.000000 129487.000000 129487.000000 129487.000000 129487.000000 129487.000000 129487.000000
mean 39.428761 1190.210662 14.643385 15.091129 3.057349 2.756786 3.306239 3.252720 2.976909 3.383204 3.441589 3.351078 3.286222 3.204685 3.642373 2.728544 3.358067 3.631886
std 15.117597 997.560954 37.932867 38.465650 1.526787 1.401662 1.266146 1.350651 1.278506 1.287032 1.319168 1.316132 1.313624 1.329905 1.176614 1.329235 1.334149 1.180082
min 7.000000 31.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 27.000000 414.000000 0.000000 0.000000 2.000000 2.000000 3.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 3.000000 2.000000 2.000000 3.000000
50% 40.000000 844.000000 0.000000 0.000000 3.000000 3.000000 3.000000 3.000000 3.000000 4.000000 4.000000 4.000000 3.000000 3.000000 4.000000 3.000000 4.000000 4.000000
75% 51.000000 1744.000000 12.000000 13.000000 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000 5.000000 4.000000 4.000000 4.000000 5.000000 4.000000 4.000000 5.000000
max 85.000000 4983.000000 1592.000000 1584.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000

The most common age of the passengers is 39, as shown by the peak of the graph above. df.desrcibe() confirms this by informing us that the mean of the “Age” column is 39.4.

Categorical Variables

chart = sns.countplot(x="Gender", data=df)
../../_images/MitraRezvany_22_0.png

The genders of the passengers are pretty evenly distributed.

chart = sns.countplot(x="Satisfaction", data=df)
../../_images/MitraRezvany_24_0.png

More people report being generally Neutral or Dissatsified on flights, however, there is not too large of a discrepancy.

sns.set(rc={'figure.figsize':(13, 8)})
fig, ax = plt.subplots(1,3)
sns.histplot(x="Customer Type", data=df, stat="percent", color="cornflowerblue", ax=ax[0])
sns.histplot(x="Type of Travel", data=df, stat="percent", color="navy", ax=ax[1])
sns.histplot(x="Class", data=df, stat="percent", color="indigo", ax=ax[2])
fig.show()
../../_images/MitraRezvany_26_0.png

These histograms shows that the majority of passengers are returning members of this specific airline’s flights, traveling for business purposes, and are mostly split between the Business and Economy classes.

Logistic Regression

Use scikit learn’s Logistic Regression to attempt to predict Satisfaction using Age and Flight Distance.

As altair cannot chart data with more than 5000 rows, we will take a sample of 5000 rows from the available 129487 rows in df and make it into a new dataframe called df1. We will then use this sample data while performing machine learning techniques.

df1 = df.sample(5000)
df1
Gender Age Customer Type Type of Travel Class Flight Distance Departure Delay Arrival Delay Departure and Arrival Time Convenience Ease of Online Booking ... On-board Service Seat Comfort Leg Room Service Cleanliness Food and Drink In-flight Service In-flight Wifi Service In-flight Entertainment Baggage Handling Satisfaction
26841 Female 61 Returning Business Economy 224 0 4.0 3 3 ... 3 4 2 1 1 3 2 3 3 Neutral or Dissatisfied
126240 Male 23 First-time Business Economy 590 2 19.0 4 3 ... 1 3 1 3 3 4 3 3 4 Neutral or Dissatisfied
35492 Female 67 Returning Personal Economy 950 0 0.0 1 2 ... 3 4 2 3 1 3 2 3 3 Neutral or Dissatisfied
53512 Female 18 Returning Personal Economy 2358 9 0.0 5 3 ... 3 2 5 2 2 4 3 2 3 Neutral or Dissatisfied
95259 Male 41 Returning Personal Economy 239 14 41.0 5 0 ... 4 4 4 4 4 3 4 4 1 Neutral or Dissatisfied
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10577 Female 40 Returning Business Business 458 0 5.0 5 5 ... 5 5 5 5 3 5 5 5 5 Satisfied
74543 Female 8 Returning Business Business 1188 6 0.0 2 2 ... 1 4 1 4 4 4 4 4 3 Neutral or Dissatisfied
12651 Male 45 First-time Business Economy Plus 844 4 0.0 3 3 ... 1 2 4 2 2 1 3 2 4 Neutral or Dissatisfied
9821 Male 21 First-time Business Business 451 22 38.0 5 4 ... 5 5 2 5 5 4 4 5 5 Satisfied
96877 Female 44 Returning Personal Economy 849 45 44.0 4 2 ... 1 4 2 5 5 1 2 1 1 Neutral or Dissatisfied

5000 rows × 23 columns

alt.Chart(df1).mark_circle().encode(
    x = "Age",
    y = "Flight Distance",
    color = "Satisfaction",
    tooltip = ["Age", "Flight Distance"]
)
no_corr = df[["Age", "Flight Distance"]]
no_corr.corr()
Age Flight Distance
Age 1.000000 0.099863
Flight Distance 0.099863 1.000000

The graph above shows that there is no correlation between Age and Flight Distance, implying that they are not good factors to predict passenger’s general satisfaction on the flight together. The table above also indicates this, as the correlation between Age and Flight Distance is extremley low (about 0.1). We will confirm this by checking how the accuracy of prediction using these factors by performing logistic regression and splitting the data is weak.

cols = ["Age", "Flight Distance"]
X_train, X_test, y_train, y_test = train_test_split(df1[cols], df1["Satisfaction"], test_size=0.2, random_state=0)
print(f"The shape of this dataset using Train is {X_train.shape}")
The shape of this dataset using Train is (4000, 2)
clf = LogisticRegression()
clf.fit(X_train, y_train)
LogisticRegression()
df1["pred"] = clf.predict(df1[cols])
clf.score(X_test, y_test)
0.678
clf.score(X_train, y_train)
0.67

We were correct 64% of the time. Since the accuracy is so low, this is a sign that these are not the best variables to use when predicting the general satifaction of passengers, as their lack of correlation hinted at, and that we should be looking into other factors from the datset instead.

KNeigborsClassifier

Use KNeighborClassifier to predict Satisfaction using Departure Delay and Arrival Delay.

check_corr = df[["Age", "Flight Distance", "Departure Delay", "Arrival Delay"]]
check_corr.corr()
Age Flight Distance Departure Delay Arrival Delay
Age 1.000000 0.099863 -0.009263 -0.011248
Flight Distance 0.099863 1.000000 0.001992 -0.001935
Departure Delay -0.009263 0.001992 1.000000 0.965291
Arrival Delay -0.011248 -0.001935 0.965291 1.000000

The table above confirms that “Departure Delay” and “Arrival Delay” are the two numerical columns with the strongest correlation (about 0.97). Therefore, we can proceed with using them to predict the general satisfaction of passengers.

k_clf = KNeighborsClassifier(n_neighbors=30)
delay = ["Departure Delay", "Arrival Delay"]
X_train2, X_test2, y_train2, y_test2 = train_test_split(df1[delay], df1["Satisfaction"], test_size=0.4, random_state=0)
k_clf.fit(X_train2,y_train2)
KNeighborsClassifier(n_neighbors=49)
df1["Prediction"] = k_clf.predict(df1[delay])

Altair

Use altair to display differences in graphics when predictions are made using KMeansClassifier.

c1 = alt.Chart(df1).mark_circle().encode( 
    x=alt.X("Departure Delay", scale=alt.Scale(zero=False)), 
    y=alt.Y("Arrival Delay", scale=alt.Scale(zero=False)), 
    color="Satisfaction",
    tooltip=["Departure Delay", "Arrival Delay"]
)


c2 = alt.Chart(df1).mark_circle().encode( 
    x=alt.X("Departure Delay", scale=alt.Scale(zero=False)), 
    y=alt.Y("Arrival Delay", scale=alt.Scale(zero=False)), 
    color="Prediction",
    tooltip=["Departure Delay", "Arrival Delay"]
)
c1|c2

The graph on the left shows the actual data for the two types of delays, while the graph of the right shows the predicted data. Both scatterplots shows the positive correlation between departure delay and arrival delay. The predicted data scatterplot is almost entirely made up of points signifying passengers being neutral or dissatisffied, with only a small portion of satisfied passengers being clustered where the x and y axes cross, essentially where there is the lowest arrival and departure delay. This makes sense as people are more pleased with the service of an airline if there are little to no delays for their flight.

Finally, we will use a for loop and log_loss to find the number of neighbors in KNeighborsClassifier that will give us the best fit graph.

for k in range(6,50):
    k_clf = KNeighborsClassifier(n_neighbors=k)
    k_clf.fit(X_train, y_train)
    loss = log_loss(y_test, k_clf.predict_proba(X_test))
    print(k, loss)
6 1.4111251863307583
7 0.899857480556819
8 0.8333009710295111
9 0.8082307400551902
10 0.7286232491501456
11 0.7390070110585765
12 0.8284365394849427
13 0.7237233569791663
14 0.730086975826891
15 0.7361681146213711
16 0.7197551478181212
17 0.7227131448999714
18 0.7085926216493472
19 0.7132651754887945
20 0.7206286495101923
21 0.725462268515407
22 0.7138777971257559
23 0.705742401354357
24 0.6983860085109066
25 0.692945073258819
26 0.6964982325491654
27 0.6913277554801867
28 0.6878890119506471
29 0.689062148796085
30 0.6870316735759631
31 0.6903532411680618
32 0.6869070903665746
33 0.6875744312757628
34 0.6891576502112906
35 0.6878724176558395
36 0.6844915334619025
37 0.6836763884426055
38 0.6851704868072505
39 0.6829297815214472
40 0.6818537677111327
41 0.6815409729458989
42 0.6804247640209738
43 0.6812981984794128
44 0.6827787650141771
45 0.6819311980858049
46 0.6807964717982479
47 0.6809733398618737
48 0.6812360389881454
49 0.6813934053853075

Log Loss is the negative average of the log of corrected predicted probabilities for each instance. The values above show that log_loss becomes smaller as k grows larger. Also, overfitting is more likely to occur with smaller values of n_neighbors. Thus, we will have the best fitted graph when we set k to 42, where log_loss is the smallest at 0.6804. (This may change if notebook is run again, but k should be between 34 and 49 for best fit graph)

Summary

After performing both logistic regression and KMeansNeighbor machine learning techniques, our results showed that the numerical variables we examined (specifically Age and Flight Distance) were not able to predict general “Satsifaction” with great accuracy. However, we can conclude that delays do have an impact on the satisfaction of passengers, as the predictions showed greater dissatisfied passengers when there were more delays.

If we want to continue working with this dataset, we can try to convert the relevant categorical factors, such as Type of Travel and Class, into Boolean values and then check to see if they can better predict the general satisfaction of passengers for this airline. Another interesting machine learning idea is to try using linear regression and the delay times to predict other numerical data like flight distance.

References

Where the dataset was found: Kaggle

The configuration of the side by side historgrams in the visualization of relevant variables section was adapted from Airline Kaggle Notebook

K-Nearest Neighbors Guide: Winter Quarter Course Notes

I decided to use log_loss instead of test error with KNeighborsClassifier through McDonald’s Menu Analysis

Created in deepnote.com Created in Deepnote