Analysis of the Alcohol Drinking in UCI Student¶

Author: Jingxi.Feng (30294115)

Course Project, UC Irvine, Math 10, S22

Introduction¶

Introduce your project here. Maybe 3 sentences.

The dataset that I am using is actually realistic data from my another current class: Stastic 7. In that class, we learn about method of different kind of stastic testing and concluding any hypothesis base on the test result. The overlapping knowledge between stastic class and math 10 lead me to wonder if I would apply my knowledge from Math 10 to Stastic 7. In general, this dataset come from my project of creating survey and getting test statstic. My purpose of the Stastic 7 project is meant to find out drinking motivation, drinking frequence amount the studnet in UCI. In the help of the professor, I successfully collect 314 students information and prepare to conduct my hypothesis using Math 10 sills.

In the end, I will also introduce an extra dataset from Duke Univesity in order to do comparison.

Main portion of the project¶

Section 1: General Look of the Dataset

Section 2: Drinking level

Section 3: Descriptive Statistics in Drinking level

Section 4: Drinking Frequence

Section 5: Inferential statistics in Drinking Frequency

Section 6: Interesting of predicting data in other college

Section 1: General Look Of the Dataset

In here, I convey the dataset from JMP into excel, and use the code to call the excel dataset into pandas

import pandas as pd 
import altair as alt 
import pandas as pd 
!pip install openpyxl
df=pd.read_excel("Stastic Data for Project.xlsx",na_values="nan")
df

Collecting openpyxl
  Downloading openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
     |████████████████████████████████| 242 kB 9.4 MB/s 
?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.10
WARNING: You are using pip version 20.1.1; however, version 22.1.2 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.

	Do you have a clear understanding of what is alcoholic drinking?	Have you experienced alcoholic drinking?	Please select all the factors that applied to your reasons for alcolchol drinking. (if the you can not find your reason below, please fill out the blank in Other)	If you responded "Other" to the previous question, put your answer here. Otherwise, type NA.	What is your usual drinking level?	What's your drinking frequency (per week)?	What gender do you most identify as?	How old will you be at December 31, 2022 (After your birthday this year)?	What is your current year at UCI?	GPA	Mood
0	Confident	No	NaN	NaN	NaN	Never	Female	19.0	First year	3.28	NaN
1	Confident	Yes	Addiction to Acholoc	NaN	Moderate (tiny dizzy)	1-2 times	Male	23.0	Second year	3.10	NaN
2	Very confident	Yes	Release stress from school works,Socailizing w...	NaN	Mild	Never	Female	20.0	Second year	3.40	NaN
3	Confident	Yes	Socailizing with peers	NaN	Mild	1-2 times	Male	20.0	First year	3.68	NaN
4	Confident	Yes	Having fun	na	Mild	1-2 times	Female	20.0	Fourth year	3.60	NaN
...	...	...	...	...	...	...	...	...	...	...	...
336	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	6.0
337	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8.0
338	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0
339	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
340	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4.0

341 rows × 11 columns

We see that there are some of missing datas and bad datas cause by non-response bias and reponse bias. We might not 100 percent drop all the missing value since the datasample is small.

Also,in order to have a easier operation on the dataset, we need to “normalize” the dataframe. We want the column name to be shorter (since the column name is correspond to each survey questions)

df.rename({"Do you have a clear understanding of what is alcoholic drinking?":"understanding"},axis=1,inplace=True)
df.rename({"Have you experienced alcoholic drinking?":"experience"},axis=1,inplace=True)
df.rename({"Please select all the factors that applied to your reasons for alcolchol drinking. (if the you can not find your reason below, please fill out the blank in Other)":"reason"},axis=1,inplace=True)
df.rename({"If you responded 'Other' to the previous question, put your answer here. Otherwise, type NA.":"Otherreason"},axis=1,inplace=True)
df.rename({"What is your usual drinking level?":"level"},axis=1,inplace=True)
df.rename({"What's your drinking frequency (per week)?":"Freq"},axis=1,inplace=True)
df.rename({"What gender do you most identify as?":"gender"},axis=1,inplace=True)
df.rename({"How old will you be at December 31, 2022 (After your birthday this year)?":"Year"},axis=1,inplace=True)
df.rename({"What is your current year at UCI?":"grade"},axis=1,inplace=True)

However, I want to drop some of the nan values. Seeing columns “Other”, which contians much of nan values (since the much participants does not have other reason for drinking alcohol). Before I drop the actually nan values, I need drop this columns first.

df.drop('If you responded "Other" to the previous question, put your answer here. Otherwise, type NA.',axis=1,inplace=True)

df=df.dropna()

Section 2: Drinking Level

In here, I want to see the relation between Drinking level and student’s grade. Before that, I need unit the participants’ answers into same grading way. For example, there are answer for first year, meanwhile, there are answer for sophomore. So I would likely to change all sophomore grade into first year grade.

df["grade"].value_counts()

First year             82
Freshman               54
Second year            48
Sophomore              28
Third year             18
Fourth year            10
Junior                  7
Fifth year or later     2
Graduate Student        1
Senior                  1
Name: grade, dtype: int64

for i in df["grade"]:
    if i=="Freshman":
        df["grade"].replace({i:"First year"},inplace=True)
    if i=="Sophomore":
        df["grade"].replace({i:"Second year"},inplace=True)
    if i=="Junior":
        df["grade"].replace({i:"Third year"},inplace=True)
    if i=="Senior":
        df["grade"].replace({i:"Third year"},inplace=True)

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/series.py:4515: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,

I set df2 as copy of df, so that I could operate more pandas operation without change the orginal dataframe.

df2=df.copy()

I would to see the numbers counts of each drinking level, so I set columns “Moderate”,”Mild”,and “Over drinking” to present the number of each cases.

df2["Moderate"]=0
df2["Mild"]=0
df2["Over drinking"]=0

df2.loc[df["level"]=="Moderate (tiny dizzy)","Moderate"]=1
df2.loc[df["level"]=="Mild","Mild"]=1
df2.loc[df["level"]=="Over drinking (e.g. Losing consciousness）","Over drinking"]=1

In here, I want to see how does the drinking level relate with grade. I want to show and compare the mean (which is the proportion of each grade on each cases) of level, because by grade distribution in the survey is not likely to be equal. Comparing proportion would be more meaningful than sum of cases. By having the visualize method, I also set up the color =counts to show first year and second year students are majority groups in my data, since they have deeper color as they have more counts.

cols=["Mild","Moderate","Over drinking"]
chart_list=[]
for i in cols:
    c=alt.Chart(df2).mark_bar().encode(
        x="grade",
        y=f"mean({i})",
        color="count()"
    )
    chart_list.append(c)
alt.hconcat(*chart_list)

I also want to see whether there is confounding variables (variables other than we concerns).Hence I present bar graph of gender and level. (In other words, I concern about gender might affect we draw out the conclusion). In the graph, below, we find out there are more female students in each cases, but it just implied more female students sample in my data.

chart_list2=[]
for i in cols:
    c=alt.Chart(df2).mark_bar().encode(
        x="gender",
        y=f"sum({i})",
        color="count()"
    )
    chart_list2.append(c)
alt.hconcat(*chart_list2)

df2

	understanding	experience	reason	level	Freq	gender	Year	grade	GPA	Mood	Moderate	Mild	Over drinking
27	Very confident	Yes	Release stress from school works,Socailizing w...	Mild	Never	Female	19.0	First year	2.500	7.0	0	1	0
28	Confident	No	Other	Mild	Never	Female	20.0	Second year	3.400	5.0	0	1	0
29	Confident	No	Release stress from school works,Release emoti...	Mild	Never	Female	20.0	Second year	3.600	6.0	0	1	0
30	Confident	No	Release emotion from relationship	Moderate (tiny dizzy)	1-2 times	Male	21.0	Second year	3.000	5.0	1	0	0
31	Very confident	Yes	Socailizing with peers,Having fun	Mild	1-2 times	Female	20.0	Second year	3.339	10.0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
307	Very confident	Yes	Socailizing with peers,Having fun	Moderate (tiny dizzy)	1-2 times	Female	19.0	First year	3.914	5.0	1	0	0
308	Confident	No	Other	Mild	Never	Female	20.0	Second year	3.694	1.0	0	1	0
309	Very confident	Yes	Release stress from school works,Socailizing w...	Moderate (tiny dizzy)	Never	Female	20.0	Second year	3.800	10.0	1	0	0
311	Very confident	Yes	Socailizing with peers,Having fun	Moderate (tiny dizzy)	1-2 times	Male	21.0	First year	3.000	8.0	1	0	0
313	Very confident	No	Other	Mild	3-4 times	Female	19.0	First year	3.400	5.0	0	1	0

251 rows × 13 columns

After, I see the graphs, I am interesting using grade and gender to predict one’s drinking level. In here, I start to setting up the numercial counts for each variables. This help me to prepare to use the machine learning of logistic regression.

for i in df2["gender"].unique():
    df2[i]=0
    df2.loc[df2["gender"]==i,i]=1

for i in df2["grade"].unique():
    df2[i]=0
    df2.loc[df2["grade"]==i,i]=1
 
genderandgrade=[i for i in df2["gender"].unique()]+[i for i in df2["grade"].unique()]

I also split up train and test groups to see if fitting model accurate on the test groups or not. Seeing the result of clf.score of 0.5298, I suppose it does not have a very accurate prediction on the test group.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
x_train,x_test,y_train,y_test=train_test_split(df2[genderandgrade],df2["level"],train_size=0.4)
clf=LogisticRegression()
clf.fit(x_train,y_train)
clf.score(x_test,y_test)

0.5695364238410596

I set up a Dataframe to contian all of my features. I put “Male” columns and “Second year” columns as 1, other as 0 , indicating I am a second grade male students. The following out is my drinking level, which is surprise me because to meet my drinking level

## Provide a example of me to try on 
JingxiFeng=pd.DataFrame({"Male":[1],"Female":[0],"Prefer not to say":[0],"Non-binary/Non-conforming":[0],"Genderqueer":[0],"Second year":[0],"First year":[1],"Fourth year":[0],"Fifth year or later":[0],"Third year":[0],"Graduate Student":[0]})
a=clf.predict(JingxiFeng[genderandgrade])
print(a)

['Mild']

Section 3: Descriptive Statistics in Drinking level

Having seeing both graphs and machine learning, I am interesting in setting up a hypothesis such that grade is association with drinking level. Hence, I am importing chi square table in order to check the association between Moderate drinking and grade.

import numpy as np
from scipy.stats import chi2_contingency
import seaborn as sns 
import matplotlib.pyplot as plt

The belowing is the DataFrame version of the chi-square table.

Chi_square_table= pd.crosstab(df2['grade'], df2['Moderate']) 
Chi_square_table.drop(index=["Fifth year or later","Graduate Student"],inplace=True)
Chi_square_table.rename({0:"Not Moderate",1:"Is Moderate"},axis=1)

Moderate	Not Moderate	Is Moderate
grade
First year	85	51
Fourth year	5	5
Second year	41	35
Third year	15	11

The below is the visual version of the Chi-square table. Seeing the data, I would have general guess that grade does not have strong association with Moderate level of drinking, because each data in the blank are close to the expect value (value that close to distribution if there is no relationship)

plt.figure(figsize=(15,8)) 
sns.heatmap(Chi_square_table, annot=True, cmap="YlGnBu")

<AxesSubplot:xlabel='Moderate', ylabel='grade'>

The below are the test-stastistic and p-value. I have set up a significant level of 0.05, and seeing that the p-value far above this significant level, I conclude that there is no sufficant evidence show Grade is strongly associate with drinking level in this stastic Class samples.

test_statistic, pvalue, dof, expected = chi2_contingency(Chi_square_table)
test_statistic=test_statistic.round(3)
pvalue=pvalue.round(3)
f"the test stastistic is {test_statistic}, and the P-value is {pvalue} which is higher than the significant level of 0.05"

'the test stastistic is 1.841, and the P-value is 0.606 which is higher than the significant level of 0.05'

Section 4: Drinking Frequence

In this section, I am interesting of seeing the Drinking reason. I, first, rename all the data in the drinking frequence columns, and create columns for each individual frequence variable in order to get the counts for each one.

for i in df2.Freq.unique():
    df2[i]=0
    df2.loc[df2["Freq"]==i,i]=1

df2.rename({"Never":"Zerotime","1-2 times":"1to2","3-4 times":"3to4","More than 4 times":"4more"},axis=1,inplace=True)

In the interest of seeing visualize version of the how drinking frequence related with GPA, I set each drinkning group with interactive funciton on altair, so I could click on each other explictly. I also set up the tooltip equal GPA mean in order to see more clearly of each case’s GPA. We could notice that there drinking more than 4 time per week tend have higher GPA.(I am surpise the sample in my stastic class provide me this answer)

multi = alt.selection_multi()
alt.Chart(df2).mark_bar().encode(
    x="Freq",
    y=alt.Y("mean(GPA)",scale=alt.Scale(domain=(3,4))),
    color=alt.condition(multi, 'Freq', alt.value('lightgray')),
    tooltip=["mean(GPA)"]
).add_selection(multi)

Seeing the previous graph, I decide to see the value counts of whether “socilizing with peers” is in the reason of drinking.

for i in df2.index:
    if "Socailizing with peers" in df2["reason"].loc[i]:
        df2["reason"].loc[i]="social"
    elif "Socailizing with peers" not in df2["reason"].loc[i]:
        df2["reason"].loc[i]="no social"


    

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/indexing.py:1637: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)

I, then, create a chart_list to see how social and no social reason relative with the drinking frequence.

cols1=["Zerotime","1to2","3to4","4more"]
chart_list3=[]

for i in cols1:
    single = alt.selection_single()
    c=alt.Chart(df2).mark_bar().encode(
        x="reason",
        y=f"mean({i})",
        color=alt.condition(single, 'reason', alt.value('lightgray')),
        tooltip="reason"
    ).add_selection(single)
    chart_list3.append(c)
alt.hconcat(*chart_list3)

In the interesting of seeing the relationship between GPA and drinking, I have also create the boxplot to compare groups of social drinking and no social drinking.

we see that,no social drinking group would have higher median GPA, but social group would have a higher maximum.

alt.Chart(df2).mark_boxplot().encode(
    x=alt.X("GPA",scale=alt.Scale(zero=False)),
    y="reason",
)

Next step, I am planing to use machine learning to find out the major influencer in drinking frequency. I set up the max_depth =3 and max_leaf_nodes equal 4 (there are 4 cases for drinking level) because I do not want the Classifier go too deep and too complex (as result of over fitting)

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import log_loss
clf1=DecisionTreeClassifier(max_depth=3,max_leaf_nodes=4)
drinkingreason=["GPA","Mood"]+cols1

from sklearn import tree

Suprisingly, this model, base on the graph, show that the machine decide the class mostly base on the gender instead of grade. This is reasonable, because in my stastic class, no much variability in grade (since everyone is in grade one and two).

clf1.fit(df2[drinkingreason],df2["Freq"])
plt.figure(figsize=(10,10))
tree.plot_tree(
    clf1,
    feature_names=clf.feature_names_in_,
    class_names=clf1.classes_,
    filled=True
);

Section 5: inferential statistics in Drinking Frequency

After seeing the above graphs and machine learning, I decide to conduct a one proportion z test and create a confident interval test.

!pip install statsmodels==0.13.2
from statsmodels.stats.proportion import proportions_ztest

Collecting statsmodels==0.13.2
  Downloading statsmodels-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.8 MB)
     |████████████████████████████████| 9.8 MB 20.7 MB/s 
?25hRequirement already satisfied: pandas>=0.25 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels==0.13.2) (1.2.5)
Requirement already satisfied: packaging>=21.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels==0.13.2) (21.3)
Requirement already satisfied: scipy>=1.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels==0.13.2) (1.7.3)
Requirement already satisfied: numpy>=1.17 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels==0.13.2) (1.21.6)
Collecting patsy>=0.5.2
  Downloading patsy-0.5.2-py2.py3-none-any.whl (233 kB)
     |████████████████████████████████| 233 kB 42.3 MB/s 
?25hRequirement already satisfied: python-dateutil>=2.7.3 in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from pandas>=0.25->statsmodels==0.13.2) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas>=0.25->statsmodels==0.13.2) (2022.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from packaging>=21.3->statsmodels==0.13.2) (3.0.9)
Requirement already satisfied: six in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from patsy>=0.5.2->statsmodels==0.13.2) (1.16.0)
Installing collected packages: patsy, statsmodels
Successfully installed patsy-0.5.2 statsmodels-0.13.2
WARNING: You are using pip version 20.1.1; however, version 22.1.2 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.

In this test, I would to explore what the proportion of zero time drinking peer week in all UCI students I set up the number of success equal to case of 1 to 2 time and trials of all sample, and the hypothesize value in 0.37 (the null hypothesis). This mean I assume the proportion of all UCI students who drinking 1 to 2 time would not be approximate 37 percent (Alternative hypothesis).

##this show there are approximately 37 percent of stastic 7 class students drink 1 to 2 time peer weeks.
##which I want to reject this value of propotion in population.
null_value=(df2["1to2"].sum()/len(df2["1to2"])).round(3)
print(null_value)

0.375

number_of_successes=df2["1to2"].sum()
number_of_trials=len(df2["1to2"])
hypothesized_proportion=null_value
z_test_statistic,pvalue=proportions_ztest(count=number_of_successes,nobs=number_of_trials,value=hypothesized_proportion)

In here, I have set up a significant level of 0.05

z_test_statistic=z_test_statistic.round(3) 
pvalue=pvalue.round(5)
print(f"the statistic for the proportion ztest is {z_test_statistic},and the P-value is {pvalue}")
if pvalue>=0.05:
    print("the pvalue is equal and greater than 0.05 significant level,and hence fail to reject null hypothesis")
    print("Hence it shows there would likely to have 37 percent of UCI student drink 1 to 2 time per weeks")
elif pvalue<0.05:
    print("the pvalue is lower than 0.05 significant level,and hence reject null hypothesis")
    print("Hence it shows there would not likely to have 37 percent of UCI student drink 1 to 2 time per weeks")

the statistic for the proportion ztest is -0.016,and the P-value is 0.98699
the pvalue is equal and greater than 0.05 significant level,and hence fail to reject null hypothesis
Hence it shows there would likely to have 37 percent of UCI student drink 1 to 2 time per weeks

I also set a 95 precent confident interval fot the test in order to guess the true value within which interval in population (all UCI students)

from statsmodels.stats.proportion import proportion_confint
lower,upper=proportion_confint(count=number_of_successes,nobs=number_of_trials,alpha=(1-0.95))
lower=lower.round(3)
upper=upper.round(3)

print(f"we are 95% confident that the proportion for all UCI students who drink 1 to 2 times per weeks would between {lower} and {upper}")

we are 95% confident that the proportion for all UCI students who drink 1 to 2 times per weeks would between 0.315 and 0.434

Section 6: Interesting of predicting data in other college

In this cases, I want to use GPA and Gender as input to predict what is the drinking level and frequency be like in Duke Univeristy

Duke=pd.read_csv("gpa.csv",na_values=" ")
Duke.dropna()

	gpa	studyweek	sleepnight	out	gender
0	3.890	50	6.0	3.0	female
1	3.900	15	6.0	1.0	female
2	3.750	15	7.0	1.0	female
3	3.600	10	6.0	4.0	male
4	4.000	25	7.0	3.0	female
5	3.150	20	7.0	3.0	male
6	3.250	15	6.0	1.0	female
7	3.925	10	8.0	3.0	female
8	3.428	12	8.0	2.0	female
9	3.800	2	8.0	4.0	male
10	3.900	10	8.0	1.0	female
11	2.900	30	6.0	2.0	female
12	3.925	30	7.0	2.0	female
13	3.650	21	9.0	3.0	female
14	3.750	10	8.5	3.5	female
15	4.670	14	6.5	3.0	male
16	3.100	12	7.5	3.5	male
17	3.800	12	8.0	1.0	female
18	3.400	4	9.0	3.0	female
19	3.575	45	6.5	1.5	female
20	3.850	6	7.0	2.5	female
21	3.400	10	7.0	3.0	female
22	3.500	12	8.0	2.0	male
23	3.600	13	6.0	3.5	female
24	3.825	35	8.0	4.0	female
25	3.925	10	8.0	3.0	female
26	4.000	40	8.0	3.0	female
27	3.425	14	9.0	3.0	female
28	3.750	30	6.0	0.0	female
29	3.150	8	6.0	0.0	female
30	3.400	8	6.5	2.0	female
31	3.700	20	7.0	1.0	female
32	3.360	40	7.0	1.0	female
33	3.700	15	7.0	1.5	male
34	3.700	25	5.0	1.0	female
35	3.600	10	7.0	2.0	female
36	3.825	18	7.0	1.5	female
37	3.200	15	6.0	1.0	female
38	3.500	30	8.0	3.0	male
39	3.500	11	7.0	1.5	female
40	3.000	28	6.0	1.5	female
41	3.980	4	7.0	1.5	female
42	3.700	4	5.0	1.0	male
43	3.810	25	7.5	2.5	female
44	4.000	42	5.0	1.0	female
45	3.100	3	7.0	2.0	male
46	3.400	42	9.0	2.0	male
47	3.500	25	8.0	2.0	male
48	3.650	20	6.0	2.0	female
49	3.700	7	8.0	2.0	female
50	3.100	6	8.0	1.0	female
51	4.000	20	7.0	3.0	female
52	3.350	45	6.0	2.0	female
53	3.541	30	7.5	1.5	female
54	2.900	20	6.0	3.0	female

for i in Duke.gender.unique():
    Duke[i]=0
    Duke.loc[Duke["gender"]==i,i]=1

Duke.rename({"gpa":"GPA","female":"Female","male":"Male"},axis=1,inplace=True)

In the following code, I find out that all 55 sample in the Duke Univeristy data does not drink at all per weeks. (notice that, it might cause by the parameter of GAP, which means, Duke’s students tend to have higher GPA, and so lead to more likely to have the output of never drink per week). Another major reason for casuing this because there are still some response bias data in orignal dataset, which might cause the wrong input.

clf_freq=LogisticRegression()
clf.fit(df2[["Female","Male","GPA"]],df2["Freq"])
Duke["Freq"]=clf.predict(Duke[["Female","Male","GPA"]])
print(Duke["Freq"].value_counts())

Never    55
Name: Freq, dtype: int64

clf_level=LogisticRegression()
clf.fit(df2[["Female","Male","GPA"]],df2["level"])
Duke["level"]=clf.predict(Duke[["Female","Male","GPA"]])
print(Duke["level"].value_counts())

Mild                     51
Moderate (tiny dizzy)     4
Name: level, dtype: int64

In here, since there is not comparsion between UCI and Duke drinking frequency, I decide to find out the comparision of drinking level.

for i in Duke.index:
    if Duke["level"].loc[i]=="Moderate (tiny dizzy)":
        Duke["level"].loc[i]="Moderate"

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/indexing.py:1637: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)

In conclusion, we predict that Duke University might have less Moderate drinking students and higher Mild drinking students in UCI.

for i in Duke["level"].unique():
    a=(Duke["level"]==i).sum()/len(Duke["level"])-(df2[i].sum()/len(df2[i]))
    a=(a*100).round(2)
    if a>0:
        print(f"Duke University has {a} percent higher of {i} level of drinking student than UCI")
    elif a<0:
        print(f"Duke University has {a} percent lower of {i} level drinking student than UCI")
    elif a==0:
        print(f"Duke University has the same percent of {i} level drinking student as UCI")

Duke University has 37.35 percent higher of Mild level of drinking student than UCI
Duke University has -34.56 percent lower of Moderate level drinking student than UCI

Summary¶

Over, I ultize the alt chart and machine learning find out that, base on my survey dataset, UCI students might likely to have moderate level of drinking. They would likely to have 1 to 2 times per weeks due to more activities of socialing with peers. In general, the chi-square test tell me that there are less there are no much strong association between grade and drinking level in UCI stastic 7 class students. Meanwhile, both one proportion z-test and confident level tell me that there would likely to have 37 percent of UCI students drink 1 to 2 time per weeks. Last, by using machine learning to compare Duke Univeristy and UCI, it predict that Duke University might tend to have less drinking frequency, but higher Mild level of drinking.

Either summarize what you did, or summarize the results. Maybe 3 sentences.

References¶

What is the source of your dataset(s)? https://www.openintro.org/data/index.php?data=gpa. I get the extra data (Duke Univeristy’s dataset) from the website: Openintro. it is provide by the researchers from this organization.

The main dataset (Stastic Data of Project) is created by my own.

Were any portions of the code or ideas taken from another source? List those sources here and say how they were used. https://medium.com/swlh/how-to-run-chi-square-test-in-python-4e9f5d10249d (chi-square test: This website is a code eductional website that provide me example and skills of how to construct a chi-square table.)

https://altair-viz.github.io/altair-tutorial/notebooks/04-Compound-charts.html (interactive in altair: This website provide all guideline for me to see different interactive funciton in altair. I check them by seeing the menu on the webiste)

https://www.statology.org/one-proportion-z-test-python/ (one proportion z-test: This website teach me how to create one proportion z-test by showing me coding and relative pandas knolwedge behind it)

https://akshay-a.medium.com/confidence-interval-for-population-proportion-basic-understanding-in-python-56b8cc5f8320 (confident level: This website show me how to create a confident level test with relative confident interval. The code it provide inspire me to create my own 95 percent confident level interval test)

List other references that you found helpful.

Created in Deepnote

UC Irvine Math 10 S22

Analysis of the Alcohol Drinking in UCI Student

Contents

Analysis of the Alcohol Drinking in UCI Student¶

Introduction¶

Main portion of the project¶

Summary¶

References¶