Airline Satisfaction Analysis¶

Author: Mitra Rezvany

Course Project, UC Irvine, Math 10, S22

Introduction¶

In this project, I will be using the airline satisfaction dataset imported from Kaggle. The dataset displays customer satisfaction scores from 120,000+ airline passengers, including details about the passenger, their type of travel, and their evaluation of different factors like cleanliness, comfort, and overall experience. The accompanying data_dictionary dataset defines each of these additional variables.

We will start by cleaning the datset using pandas and exploring its factors/variables using plotly and seaborn charts. Then, we will explore if the passenger’s age and the distance their flight travels can predict their satisfaction by using scikit learn’s Logistic Regression. Furthermore, we will use K-Nearest Neighbors Classifier to predict the satisfaction of passengers using factors regarding flight delays from the dataset. We’ll display these findings using interactive altair charts.

Main portion of the project¶

import pandas as pd
import seaborn as sns
import numpy as np
import altair as alt
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

Import the Data¶

df = pd.read_csv("airline_satisfaction.csv")
data_dict = pd.read_csv("data_dictionary.csv")
df

	ID	Gender	Age	Customer Type	Type of Travel	Class	Flight Distance	Departure Delay	Arrival Delay	Departure and Arrival Time Convenience	...	On-board Service	Seat Comfort	Leg Room Service	Cleanliness	Food and Drink	In-flight Service	In-flight Wifi Service	In-flight Entertainment	Baggage Handling	Satisfaction
0	1	Male	48	First-time	Business	Business	821	2	5.0	3	...	3	5	2	5	5	5	3	5	5	Neutral or Dissatisfied
1	2	Female	35	Returning	Business	Business	821	26	39.0	2	...	5	4	5	5	3	5	2	5	5	Satisfied
2	3	Male	41	Returning	Business	Business	853	0	0.0	4	...	3	5	3	5	5	3	4	3	3	Satisfied
3	4	Male	50	Returning	Business	Business	1905	0	0.0	2	...	5	5	5	4	4	5	2	5	5	Satisfied
4	5	Female	49	Returning	Business	Business	3470	0	1.0	3	...	3	4	4	5	4	3	3	3	3	Satisfied
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
129875	129876	Male	28	Returning	Personal	Economy Plus	447	2	3.0	4	...	5	1	4	4	4	5	4	4	4	Neutral or Dissatisfied
129876	129877	Male	41	Returning	Personal	Economy Plus	308	0	0.0	5	...	5	2	5	2	2	4	3	2	5	Neutral or Dissatisfied
129877	129878	Male	42	Returning	Personal	Economy Plus	337	6	14.0	5	...	3	3	4	3	3	4	2	3	5	Neutral or Dissatisfied
129878	129879	Male	50	Returning	Personal	Economy Plus	337	31	22.0	4	...	4	4	5	3	3	4	5	3	5	Satisfied
129879	129880	Female	20	Returning	Personal	Economy Plus	337	0	0.0	1	...	4	2	4	2	2	2	3	2	1	Neutral or Dissatisfied

129880 rows × 24 columns

data_dict

	Field	Description
0	ID	Unique passenger identifier
1	Gender	Gender of the passenger (Female/Male)
2	Age	Age of the passenger
3	Customer Type	Type of airline customer (First-time/Returning)
4	Type of Travel	Purpose of the flight (Business/Personal)
5	Class	Travel class in the airplane for the passenger...
6	Flight Distance	Flight distance in miles
7	Departure Delay	Flight departure delay in minutes
8	Arrival Delay	Flight arrival delay in minutes
9	Departure and Arrival Time Convenience	Satisfaction level with the convenience of the...
10	Ease of Online Booking	Satisfaction level with the online booking exp...
11	Check-in Service	Satisfaction level with the check-in service f...
12	Online Boarding	Satisfaction level with the online boarding ex...
13	Gate Location	Satisfaction level with the gate location in t...
14	On-board Service	Satisfaction level with the on-boarding servic...
15	Seat Comfort	Satisfaction level with the comfort of the air...
16	Leg Room Service	Satisfaction level with the leg room of the ai...
17	Cleanliness	Satisfaction level with the cleanliness of the...
18	Food and Drink	Satisfaction level with the food and drinks on...
19	In-flight Service	Satisfaction level with the in-flight service ...
20	In-flight Wifi Service	Satisfaction level with the in-flight Wifi ser...
21	In-flight Entertainment	Satisfaction level with the in-flight entertai...
22	Baggage Handling	Satisfaction level with the baggage handling f...
23	Satisfaction	Overall satisfaction level with the airline (S...

# info on datatypes and null entries
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 24 columns):
 #   Column                                  Non-Null Count   Dtype  
---  ------                                  --------------   -----  
 ID                                      129880 non-null  int64  
 Gender                                  129880 non-null  object 
 Age                                     129880 non-null  int64  
 Customer Type                           129880 non-null  object 
 Type of Travel                          129880 non-null  object 
 Class                                   129880 non-null  object 
 Flight Distance                         129880 non-null  int64  
 Departure Delay                         129880 non-null  int64  
 Arrival Delay                           129487 non-null  float64
 Departure and Arrival Time Convenience  129880 non-null  int64  
Ease of Online Booking                  129880 non-null  int64  
Check-in Service                        129880 non-null  int64  
Online Boarding                         129880 non-null  int64  
Gate Location                           129880 non-null  int64  
On-board Service                        129880 non-null  int64  
Seat Comfort                            129880 non-null  int64  
Leg Room Service                        129880 non-null  int64  
Cleanliness                             129880 non-null  int64  
Food and Drink                          129880 non-null  int64  
In-flight Service                       129880 non-null  int64  
In-flight Wifi Service                  129880 non-null  int64  
In-flight Entertainment                 129880 non-null  int64  
Baggage Handling                        129880 non-null  int64  
Satisfaction                            129880 non-null  object 
dtypes: float64(1), int64(18), object(5)
memory usage: 23.8+ MB

df.isna().any()

ID                                        False
Gender                                    False
Age                                       False
Customer Type                             False
Type of Travel                            False
Class                                     False
Flight Distance                           False
Departure Delay                           False
Arrival Delay                              True
Departure and Arrival Time Convenience    False
Ease of Online Booking                    False
Check-in Service                          False
Online Boarding                           False
Gate Location                             False
On-board Service                          False
Seat Comfort                              False
Leg Room Service                          False
Cleanliness                               False
Food and Drink                            False
In-flight Service                         False
In-flight Wifi Service                    False
In-flight Entertainment                   False
Baggage Handling                          False
Satisfaction                              False
dtype: bool

Although this dataset is mostly clean, df.isna().any() points out that the “Arrival Delay” column contains missing values so we’ll have to drop those from df before exploring the dataset further. We use df.shape to confirm this change in the dataset.

df.shape

(129880, 24)

df.dropna(inplace=True)

print(f"The new number of rows in this dataset is {df.shape[0]}")

The new number of rows in this dataset is 129487

The “ID” column in df seems irrelevant and part of Kaggle data assignment rather than being a column significant to this dataset so we’ll remove it.

df = df.drop("ID", axis=1)

df.head()

	Gender	Age	Customer Type	Type of Travel	Class	Flight Distance	Departure Delay	Arrival Delay	Departure and Arrival Time Convenience	Ease of Online Booking	...	On-board Service	Seat Comfort	Leg Room Service	Cleanliness	Food and Drink	In-flight Service	In-flight Wifi Service	In-flight Entertainment	Baggage Handling	Satisfaction
0	Male	48	First-time	Business	Business	821	2	5.0	3	3	...	3	5	2	5	5	5	3	5	5	Neutral or Dissatisfied
1	Female	35	Returning	Business	Business	821	26	39.0	2	2	...	5	4	5	5	3	5	2	5	5	Satisfied
2	Male	41	Returning	Business	Business	853	0	0.0	4	4	...	3	5	3	5	5	3	4	3	3	Satisfied
3	Male	50	Returning	Business	Business	1905	0	0.0	2	2	...	5	5	5	4	4	5	2	5	5	Satisfied
4	Female	49	Returning	Business	Business	3470	0	1.0	3	3	...	3	4	4	5	4	3	3	3	3	Satisfied

5 rows × 23 columns

Visualization of Variables¶

Numerical Variables: Matplotlib and Seaborn Graphics¶

# When displaying the count of ages of passengers, the ticks on the x-axis overlap. 
# To avoid this, we change the figure size to 20in wide by 4in high
plt.figure(figsize=(20,4))
sns.countplot(x="Age", data=df)
plt.show()

# Statistics of Numerical Variables
df.describe()

	Age	Flight Distance	Departure Delay	Arrival Delay	Departure and Arrival Time Convenience	Ease of Online Booking	Check-in Service	Online Boarding	Gate Location	On-board Service	Seat Comfort	Leg Room Service	Cleanliness	Food and Drink	In-flight Service	In-flight Wifi Service	In-flight Entertainment	Baggage Handling
count	129487.000000	129487.000000	129487.000000	129487.000000	129487.000000	129487.000000	129487.000000	129487.000000	129487.000000	129487.000000	129487.000000	129487.000000	129487.000000	129487.000000	129487.000000	129487.000000	129487.000000	129487.000000
mean	39.428761	1190.210662	14.643385	15.091129	3.057349	2.756786	3.306239	3.252720	2.976909	3.383204	3.441589	3.351078	3.286222	3.204685	3.642373	2.728544	3.358067	3.631886
std	15.117597	997.560954	37.932867	38.465650	1.526787	1.401662	1.266146	1.350651	1.278506	1.287032	1.319168	1.316132	1.313624	1.329905	1.176614	1.329235	1.334149	1.180082
min	7.000000	31.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
25%	27.000000	414.000000	0.000000	0.000000	2.000000	2.000000	3.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	3.000000	2.000000	2.000000	3.000000
50%	40.000000	844.000000	0.000000	0.000000	3.000000	3.000000	3.000000	3.000000	3.000000	4.000000	4.000000	4.000000	3.000000	3.000000	4.000000	3.000000	4.000000	4.000000
75%	51.000000	1744.000000	12.000000	13.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	5.000000	4.000000	4.000000	4.000000	5.000000	4.000000	4.000000	5.000000
max	85.000000	4983.000000	1592.000000	1584.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000

The most common age of the passengers is 39, as shown by the peak of the graph above. df.desrcibe() confirms this by informing us that the mean of the “Age” column is 39.4.

Categorical Variables¶

chart = sns.countplot(x="Gender", data=df)

The genders of the passengers are pretty evenly distributed.

chart = sns.countplot(x="Satisfaction", data=df)

More people report being generally Neutral or Dissatsified on flights, however, there is not too large of a discrepancy.

sns.set(rc={'figure.figsize':(13, 8)})
fig, ax = plt.subplots(1,3)
sns.histplot(x="Customer Type", data=df, stat="percent", color="cornflowerblue", ax=ax[0])
sns.histplot(x="Type of Travel", data=df, stat="percent", color="navy", ax=ax[1])
sns.histplot(x="Class", data=df, stat="percent", color="indigo", ax=ax[2])
fig.show()

These histograms shows that the majority of passengers are returning members of this specific airline’s flights, traveling for business purposes, and are mostly split between the Business and Economy classes.

Logistic Regression¶

Use scikit learn’s Logistic Regression to attempt to predict Satisfaction using Age and Flight Distance.

As altair cannot chart data with more than 5000 rows, we will take a sample of 5000 rows from the available 129487 rows in df and make it into a new dataframe called df1. We will then use this sample data while performing machine learning techniques.

df1 = df.sample(5000)
df1

	Gender	Age	Customer Type	Type of Travel	Class	Flight Distance	Departure Delay	Arrival Delay	Departure and Arrival Time Convenience	Ease of Online Booking	...	On-board Service	Seat Comfort	Leg Room Service	Cleanliness	Food and Drink	In-flight Service	In-flight Wifi Service	In-flight Entertainment	Baggage Handling	Satisfaction
26841	Female	61	Returning	Business	Economy	224	0	4.0	3	3	...	3	4	2	1	1	3	2	3	3	Neutral or Dissatisfied
126240	Male	23	First-time	Business	Economy	590	2	19.0	4	3	...	1	3	1	3	3	4	3	3	4	Neutral or Dissatisfied
35492	Female	67	Returning	Personal	Economy	950	0	0.0	1	2	...	3	4	2	3	1	3	2	3	3	Neutral or Dissatisfied
53512	Female	18	Returning	Personal	Economy	2358	9	0.0	5	3	...	3	2	5	2	2	4	3	2	3	Neutral or Dissatisfied
95259	Male	41	Returning	Personal	Economy	239	14	41.0	5	0	...	4	4	4	4	4	3	4	4	1	Neutral or Dissatisfied
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10577	Female	40	Returning	Business	Business	458	0	5.0	5	5	...	5	5	5	5	3	5	5	5	5	Satisfied
74543	Female	8	Returning	Business	Business	1188	6	0.0	2	2	...	1	4	1	4	4	4	4	4	3	Neutral or Dissatisfied
12651	Male	45	First-time	Business	Economy Plus	844	4	0.0	3	3	...	1	2	4	2	2	1	3	2	4	Neutral or Dissatisfied
9821	Male	21	First-time	Business	Business	451	22	38.0	5	4	...	5	5	2	5	5	4	4	5	5	Satisfied
96877	Female	44	Returning	Personal	Economy	849	45	44.0	4	2	...	1	4	2	5	5	1	2	1	1	Neutral or Dissatisfied

5000 rows × 23 columns

alt.Chart(df1).mark_circle().encode(
    x = "Age",
    y = "Flight Distance",
    color = "Satisfaction",
    tooltip = ["Age", "Flight Distance"]
)

no_corr = df[["Age", "Flight Distance"]]
no_corr.corr()

	Age	Flight Distance
Age	1.000000	0.099863
Flight Distance	0.099863	1.000000

The graph above shows that there is no correlation between Age and Flight Distance, implying that they are not good factors to predict passenger’s general satisfaction on the flight together. The table above also indicates this, as the correlation between Age and Flight Distance is extremley low (about 0.1). We will confirm this by checking how the accuracy of prediction using these factors by performing logistic regression and splitting the data is weak.

cols = ["Age", "Flight Distance"]

X_train, X_test, y_train, y_test = train_test_split(df1[cols], df1["Satisfaction"], test_size=0.2, random_state=0)

print(f"The shape of this dataset using Train is {X_train.shape}")

The shape of this dataset using Train is (4000, 2)

clf = LogisticRegression()

clf.fit(X_train, y_train)

LogisticRegression()

df1["pred"] = clf.predict(df1[cols])

clf.score(X_test, y_test)

0.678

clf.score(X_train, y_train)

0.67

We were correct 64% of the time. Since the accuracy is so low, this is a sign that these are not the best variables to use when predicting the general satifaction of passengers, as their lack of correlation hinted at, and that we should be looking into other factors from the datset instead.

KNeigborsClassifier¶

Use KNeighborClassifier to predict Satisfaction using Departure Delay and Arrival Delay.

check_corr = df[["Age", "Flight Distance", "Departure Delay", "Arrival Delay"]]
check_corr.corr()

	Age	Flight Distance	Departure Delay	Arrival Delay
Age	1.000000	0.099863	-0.009263	-0.011248
Flight Distance	0.099863	1.000000	0.001992	-0.001935
Departure Delay	-0.009263	0.001992	1.000000	0.965291
Arrival Delay	-0.011248	-0.001935	0.965291	1.000000

The table above confirms that “Departure Delay” and “Arrival Delay” are the two numerical columns with the strongest correlation (about 0.97). Therefore, we can proceed with using them to predict the general satisfaction of passengers.

k_clf = KNeighborsClassifier(n_neighbors=30)

delay = ["Departure Delay", "Arrival Delay"]

X_train2, X_test2, y_train2, y_test2 = train_test_split(df1[delay], df1["Satisfaction"], test_size=0.4, random_state=0)

k_clf.fit(X_train2,y_train2)

KNeighborsClassifier(n_neighbors=49)

df1["Prediction"] = k_clf.predict(df1[delay])

Altair¶

Use altair to display differences in graphics when predictions are made using KMeansClassifier.

c1 = alt.Chart(df1).mark_circle().encode( 
    x=alt.X("Departure Delay", scale=alt.Scale(zero=False)), 
    y=alt.Y("Arrival Delay", scale=alt.Scale(zero=False)), 
    color="Satisfaction",
    tooltip=["Departure Delay", "Arrival Delay"]
)


c2 = alt.Chart(df1).mark_circle().encode( 
    x=alt.X("Departure Delay", scale=alt.Scale(zero=False)), 
    y=alt.Y("Arrival Delay", scale=alt.Scale(zero=False)), 
    color="Prediction",
    tooltip=["Departure Delay", "Arrival Delay"]
)

c1|c2

The graph on the left shows the actual data for the two types of delays, while the graph of the right shows the predicted data. Both scatterplots shows the positive correlation between departure delay and arrival delay. The predicted data scatterplot is almost entirely made up of points signifying passengers being neutral or dissatisffied, with only a small portion of satisfied passengers being clustered where the x and y axes cross, essentially where there is the lowest arrival and departure delay. This makes sense as people are more pleased with the service of an airline if there are little to no delays for their flight.

Finally, we will use a for loop and log_loss to find the number of neighbors in KNeighborsClassifier that will give us the best fit graph.

for k in range(6,50):
    k_clf = KNeighborsClassifier(n_neighbors=k)
    k_clf.fit(X_train, y_train)
    loss = log_loss(y_test, k_clf.predict_proba(X_test))
    print(k, loss)

1.4111251863307583
0.899857480556819
0.8333009710295111
0.8082307400551902
0.7286232491501456
0.7390070110585765
0.8284365394849427
0.7237233569791663
0.730086975826891
0.7361681146213711
0.7197551478181212
0.7227131448999714
0.7085926216493472
0.7132651754887945
0.7206286495101923
0.725462268515407
0.7138777971257559
0.705742401354357
0.6983860085109066
0.692945073258819
0.6964982325491654
0.6913277554801867
0.6878890119506471
0.689062148796085
0.6870316735759631
0.6903532411680618
0.6869070903665746
0.6875744312757628
0.6891576502112906
0.6878724176558395
0.6844915334619025
0.6836763884426055
0.6851704868072505
0.6829297815214472
0.6818537677111327
0.6815409729458989
0.6804247640209738
0.6812981984794128
0.6827787650141771
0.6819311980858049
0.6807964717982479
0.6809733398618737
0.6812360389881454
0.6813934053853075

Log Loss is the negative average of the log of corrected predicted probabilities for each instance. The values above show that log_loss becomes smaller as k grows larger. Also, overfitting is more likely to occur with smaller values of n_neighbors. Thus, we will have the best fitted graph when we set k to 42, where log_loss is the smallest at 0.6804. (This may change if notebook is run again, but k should be between 34 and 49 for best fit graph)

Summary¶

After performing both logistic regression and KMeansNeighbor machine learning techniques, our results showed that the numerical variables we examined (specifically Age and Flight Distance) were not able to predict general “Satsifaction” with great accuracy. However, we can conclude that delays do have an impact on the satisfaction of passengers, as the predictions showed greater dissatisfied passengers when there were more delays.

If we want to continue working with this dataset, we can try to convert the relevant categorical factors, such as Type of Travel and Class, into Boolean values and then check to see if they can better predict the general satisfaction of passengers for this airline. Another interesting machine learning idea is to try using linear regression and the delay times to predict other numerical data like flight distance.

References¶

Where the dataset was found: Kaggle

The configuration of the side by side historgrams in the visualization of relevant variables section was adapted from Airline Kaggle Notebook

K-Nearest Neighbors Guide: Winter Quarter Course Notes

I decided to use log_loss instead of test error with KNeighborsClassifier through McDonald’s Menu Analysis

Created in Deepnote

UC Irvine Math 10 S22

Airline Satisfaction Analysis

Contents