# Wine Quality

Author:  Paula Mendez-Lagunas

Course Project, UC Irvine, Math 10, S23

## Introduction

The dataset my project focuses on is about wine and it contains columns that describe chemical properties related to each wine and a column which assigns it a quality score. I decided to use classification machine learning models on the data after refining the original dataframe. The two models I chose to use are DecicionTreeClassifier and KNeighborsClassifier.

### Importing 

In [1]:
import pandas as pd
import altair as alt

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

## Section 1: Preparing the Data
In this section I use pandas to gain information about the original dataframe and create a new column named Class indicating whether the wine is good or bad based on the quality score it received. Finally I create a new dataframe with balanced values of good and bad wines.

In [2]:
df_temp = pd.read_csv("winequality-red.csv")

In [3]:
# Checking if there are any columns with missing values; notice all columns are numerical
df_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


I actually want this to be a classification problem so I want to add a new column that indicates whether the wine's quality is good or bad. I do this using a lambda function and map.

In [4]:
df_temp["quality"].max()

8

In [5]:
df_temp["class"] = df_temp["quality"].map(lambda x: "good" if x>6 else "bad")

In [6]:
# This helps visualize the proportion of good wine to bad wine in the original dataframe
alt.Chart(df_temp).mark_bar().encode(
    x = "class",
    y = "count()",
    tooltip = ["count()"]
)

In order to create a more balanced DataFrame I will get 250 random rows whose class is bad and then concatenate it to a dataframe that has all the rows whose class is good.

In [7]:
# I used a random state to get reproducible results
df_good = df_temp[df_temp["class"] == "good"]
df_bad = df_temp[df_temp["class"] == "bad"].sample(250, random_state= 97)

In [8]:
# I use axis = 0 since I want to join them along their rows
df = pd.concat((df_good, df_bad), axis= 0)

In [9]:
df.shape

(467, 13)

Therefore my final dataframe is named df and we can see that it now contains 467 rows (of which 250 are labeled "bad" wine) and 13 columns (the 12 original and one we added named "class").

## Section 2: Visualizing Data Relations
In this section I mainly use altair to create charts for specific data relations that I want to see. Also in order to avoid rewriting the same code I write a fuction called 'make_chart'.  

In [10]:
# This helps me understand the distribution of the data in each column
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,467.0,467.0,467.0,467.0,467.0,467.0,467.0,467.0,467.0,467.0,467.0,467.0
mean,8.438972,0.486895,0.297559,2.588544,0.083615,14.706638,41.557816,0.996456,3.307323,0.686574,10.852819,6.179872
std,1.778647,0.177284,0.203943,1.415962,0.042843,9.873047,32.468504,0.002044,0.147025,0.158632,1.171713,0.977088
min,4.6,0.12,0.0,0.9,0.012,1.0,7.0,0.99007,2.88,0.37,8.7,3.0
25%,7.2,0.35,0.1,1.9,0.066,7.0,20.0,0.99516,3.21,0.57,9.8,5.0
50%,8.1,0.46,0.32,2.2,0.077,12.0,31.0,0.99643,3.3,0.67,10.8,6.0
75%,9.55,0.605,0.45,2.6,0.089,19.0,52.0,0.9977,3.39,0.77,11.7,7.0
max,15.6,1.33,0.79,15.4,0.467,55.0,289.0,1.00369,3.9,1.56,14.0,8.0


From the information above I can see that some columns have larger scales (for example total sulfur dioxide) while some columns have smaller scales (such as density). Also, by comparing the mean, standard deviation, and max value of each column it seems that most columns have outliers. 

In [11]:
#The following code is an adaptation of a code that was used in Worksheet 7
def make_chart(col):
    return alt.Chart(df).mark_circle().encode(
        x= alt.X("quality", scale= alt.Scale(zero= False)),
        y= alt.Y(col, scale= alt.Scale(zero= False)),
        color= "class",
    )

Using the code above, I want to make a chart comparing each input feature column to the quality. Furthermore, since from above we know that each column has a different scale I chose to include 'zero= False' in the code so that each chart could have its own scale.

In [12]:
#Only want the first 11 because those are input features and the other 2 cols are output features
cols = [col for col in df.columns[:11]]

In [13]:
chart_list = [make_chart(col) for col in cols]

In [14]:
# This code was also taken from Worksheet 7
total_chart = alt.vconcat(*chart_list)
total_chart

Based on the charts above, almost every single one has outlier(s) although it is most noticeable for residual sugar, chlorides, and total sulfure dioxide. Furthermore we can see that only using one input feature or column would be hard to help classify the wine since there are no clear patterns in these charts.

## Section 3: Machine Learning Models
In this final section I apply the DecisionTreeClassifier and the KNeighborsClassifier models and compare their accuracy. I also use train_test_split to test the DecisionTreeClassifier model for overfitting. For the KNeighborsClassifier model I create a confusion matrix to see its prediction results.

In [15]:
# Instantiate; I decided to use 15 max leaf nodes because there are 11 input variables and 15>11
clf = DecisionTreeClassifier(max_leaf_nodes= 15, random_state= 126)

In [16]:
#Fit; I am using all 11 feature columns and want to predict whether the wine is "good" or "bad"
clf.fit(df[cols], df["class"])

DecisionTreeClassifier(max_leaf_nodes=15, random_state=126)

In [17]:
# This reveals that the most influencing feature is Alcohol
# The 3 columns I said had most noticeable outliers are at the bottom of this chart, I wonder why
pd.Series(clf.feature_importances_ , clf.feature_names_in_).sort_values(ascending= False)

alcohol                 0.507042
volatile acidity        0.171973
sulphates               0.124870
fixed acidity           0.065348
citric acid             0.034490
total sulfur dioxide    0.031346
density                 0.022035
pH                      0.021488
chlorides               0.021406
residual sugar          0.000000
free sulfur dioxide     0.000000
dtype: float64

In [18]:
clf.score(df[cols], df["class"])

0.8758029978586723

The score for this classifier is significantly higher than if one was random guessing, so it makes me wonder if the model is overfitting the data. In order to test this classifier for overfitting I'll divide the data into a training set and a test set using train_test_split.

In [19]:
# The training set has 60% of the data
X_train, X_test, y_train, y_test = train_test_split(
    df[cols], df["class"], train_size= 0.6, random_state= 0
)

In [20]:
# This time I only want to fit using the X_train and y_train data
clf.fit(X_train, y_train)

DecisionTreeClassifier(max_leaf_nodes=15, random_state=126)

In [21]:
# This describes its accuracy for the training data
clf.score(X_train, y_train)

0.9214285714285714

In [22]:
# This descibes its accuracy for the testing data, which it has never seen
clf.score(X_test, y_test)

0.7433155080213903

Comparing the classifier's scores for the training and testing data, they are quite close even though the score for the training data was higher. This makes me doubt that the model is overfitting but I'm not quite sure.

Next I want to try a different Machine Learning model called K-Nearest Neighbors.

In [23]:
#Instantiate; 16 seems like a good number to try
# I actually initially tried 18 but the score was lower
knc = KNeighborsClassifier(n_neighbors= 16)

In [24]:
#Fit; I use the same input and output features as before
knc.fit(df[cols], df["class"])

KNeighborsClassifier(n_neighbors=16)

In [25]:
knc.score(df[cols], df["class"])

0.721627408993576

This model's score is still much higher than random guessing, but it is not as good as the DecisionTreeClassifier model.

Next I want to see which and how many wines were missclassfied.

In [26]:
df["pred_knc"] = knc.predict(df[cols])

In [27]:
knc.classes_

array(['bad', 'good'], dtype=object)

In [28]:
# This shows the models confidence in its prediction; note there are various rows that are 50/50
arr = knc.predict_proba(df[cols])
arr

array([[0.5   , 0.5   ],
       [0.4375, 0.5625],
       [0.75  , 0.25  ],
       [0.4375, 0.5625],
       [0.875 , 0.125 ],
       [0.5625, 0.4375],
       [0.5625, 0.4375],
       [0.6875, 0.3125],
       [0.25  , 0.75  ],
       [0.25  , 0.75  ],
       [0.375 , 0.625 ],
       [0.8125, 0.1875],
       [0.1875, 0.8125],
       [0.1875, 0.8125],
       [0.5   , 0.5   ],
       [0.25  , 0.75  ],
       [0.5   , 0.5   ],
       [0.25  , 0.75  ],
       [0.625 , 0.375 ],
       [0.375 , 0.625 ],
       [0.625 , 0.375 ],
       [0.5625, 0.4375],
       [0.5625, 0.4375],
       [0.75  , 0.25  ],
       [0.75  , 0.25  ],
       [0.25  , 0.75  ],
       [0.625 , 0.375 ],
       [0.375 , 0.625 ],
       [0.1875, 0.8125],
       [0.5625, 0.4375],
       [0.25  , 0.75  ],
       [0.1875, 0.8125],
       [0.4375, 0.5625],
       [0.4375, 0.5625],
       [0.25  , 0.75  ],
       [0.1875, 0.8125],
       [0.25  , 0.75  ],
       [0.5625, 0.4375],
       [0.625 , 0.375 ],
       [0.5625, 0.4375],


In [29]:
# Here I add the probabilities into df as new columns
df["knc_bad_proba"] = arr[:,0]
df["knc_good_proba"] = arr[:,1]

In [30]:
# Here I make a smaller dataframe containing all the rows where the class probability was 50/50
df2 = df[df["knc_bad_proba"] == 0.5]
df2

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,class,pred_knc,knc_bad_proba,knc_good_proba
7,7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,7,good,bad,0.5,0.5
259,10.0,0.31,0.47,2.6,0.085,14.0,33.0,0.99965,3.36,0.8,10.5,7,good,bad,0.5,0.5
267,7.9,0.35,0.46,3.6,0.078,15.0,37.0,0.9973,3.35,0.86,12.8,8,good,bad,0.5,0.5
440,12.6,0.31,0.72,2.2,0.072,6.0,29.0,0.9987,2.88,0.82,9.8,8,good,bad,0.5,0.5
501,10.4,0.44,0.73,6.55,0.074,38.0,76.0,0.999,3.17,0.85,12.0,7,good,bad,0.5,0.5
502,10.4,0.44,0.73,6.55,0.074,38.0,76.0,0.999,3.17,0.85,12.0,7,good,bad,0.5,0.5
538,12.9,0.35,0.49,5.8,0.066,5.0,35.0,1.0014,3.2,0.66,12.0,7,good,bad,0.5,0.5
586,11.1,0.31,0.49,2.7,0.094,16.0,47.0,0.9986,3.12,1.02,10.6,7,good,bad,0.5,0.5
588,5.0,0.42,0.24,2.0,0.06,19.0,50.0,0.9917,3.72,0.74,14.0,8,good,bad,0.5,0.5
589,10.2,0.29,0.49,2.6,0.059,5.0,13.0,0.9976,3.05,0.74,10.5,7,good,bad,0.5,0.5


From this smaller dataframe we see that 31 rows had 50/50 chance according to this model to be classified as "bad" or "good". Since the first class defined in the classifier is "bad" it predicted that for each wine. However, the class of more than half of these rows was "good" therefore at least 15 wines were misclassified as "bad".

In order to see how many of each class the model misclassified I want to create a confusion matrix. Based on the above, it seems that the model should have misclassified more "good" wines.

In [31]:
c1 = alt.Chart(df).mark_rect().encode(
    x= "class",
    y= "pred_knc",
    color= alt.Color("count()", scale= alt.Scale(scheme= "redblue"))
)

c2 = alt.Chart(df).mark_text(color= "white").encode(
    x= "class",
    y= "pred_knc",
    text= "count()"
)

(c1 + c2).properties(
    height= 250,
    width= 250
)

Surely enough, the model misclassfied more "good" wines. df did contain more rows whose class is "bad" than rows whose class is "good" so I feel like that might have influenced the model to classify more wines as "bad".

## Summary

Overall, I used the wine dataset for classification. The models I focused on were DecisionTreeClassifier and KNeighborsClassifier, where DecisionTreeClassifier resulted being more accurate. Based on the predictions that the KNeighborsClassifier model made and the fact that df had unequal quatities of good and bad wine it was not a very accurate model.

## References

Your code above should include references.  Here is some additional space for references.

* What is the source of your dataset(s)?

[Wine Dataset](https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009)
I found this dataset on Kaggle

* List any other references that you found helpful.

[K-Nearest Neighbors](https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html)
Here is where I learned how to use KNeighborsClassifier

[Worksheet 7](https://christopherdavisuci.github.io/UCI-Math-10-S23/Week4/Worksheet7.html)
This worksheet helped me with plotting various charts all together

[red_wine_classification](https://www.kaggle.com/code/maxzen/red-wine-classification)
This is a notebook from Kaggle that gave me a few ideas for analyzing the dataframe

## Submission

Using the Share button at the top right, **enable Comment privileges** for anyone with a link to the project. Then submit that link on Canvas.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=ba81e9b8-b42c-4534-973b-89a7a092a38d' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>