Week 1 Tuesday Discussion
Contents
Week 1 Tuesday Discussion¶
Meeting Times TuTh: 14:00-14:50 in ALP 3600¶
Office Hours: TBA but will be over Zoom¶
TA: Yasmeen Baki mailto:ybaki@uci.edu¶
Plan for Today:
Introductions and time to find people to work with ~10 minutes
Overview of discussion policies ~5 minutes
Getting comfortable with Deepnote (e.g. markdown versus code cells)
Practice with uploading data, pandas, and getting started on Homework 1
Overview of Our Discussion Sections¶
Purpose: Discussion sections are a time for you to reinforce the material that you have been learning in lecture throughout the week. My general plan is for us to try some exercises and go through some homework all together as a class, but also to leave time for individual work where Chupeng, Yufei, and I can go around answering specific questions you may have.
Quizzes: Quizzes will typically be during the last 20 minutes of our Tuesday discussions. I will give a review of the quiz material during the first 30 minutes of our discussion on these days.
Office Hours: Office hours (and Ed Discussion!) are some of the best places to get fast help. Please do not ask detailed questions about your code right after or before our discussion times – this can create serious delays for our class, and those right aftewards.
Email Policy: Email should be reserved for personal/private concerns (e.g. illness, family emergency, etc.), and not for homework or lecture related questions (this is what Ed Discussion, office hours, and discussion is for). Further, please be patient and allow me about 24 hours to get back to your email; in particular, do not send me the same email multiple times.
General Advice and Style Guidelines:
Be as organized as possible when saving files on your computer; it helps to have a folder dedicated to this class. Don’t save everything in your Downloads folder!
Use descriptive names for variables and files.
Use comments to make your code more readable to yourself and others.
Start early, start often
Ask for help!
Getting Comfortable with Deepnote¶
All of your work in Deepnote will be done in cells. This is an example of a markdown cell. Markdown cells are used for displaying text, and in our class are an important part of answering homework questions each week.
To create a markdown cell below this cell, we can first use the shortcut ⌘ + j
on Mac, or ctrl + j
on PC to create a new code cell. Then, we can convert this new cell to a markdown cell by using the command ⌘+shift+m
, or ctrl+shift+m
.
Exercise 1: Using only keyboard shorcuts, create a new markdown cell below this one. Write a short self-introduction. Using the code from this exercise, change the font color of your self-introduction to blue.
Remember: Markdown is subtly different on different sites, so what might work in Jupyter or GitHub, for example, might not work in Deepnote.
It is worth taking a look at this list of keyboard shortcuts for working in Deepnote. Spending the time to learn at least a few of these shortcuts now will make your life much easier going forward.
Exercise 2: Use the link above to learn the keyboard shortcut for deleting a cell. Using only keyboard shortcuts, create a new cell and then delete it.
#This is an example of a comment inside of a code cell
#Comments can be used to help people reading your code understand it better...
#they can also be used to remove portions of code from being evaluated (think debugging!)
2**3
8
Exercise 3: Create a new code cell and evaluate 2^3
. Is this different than what you would expect?
Uploading files, pandas, and getting started on Homework 1¶
Exercise 4: Import pandas. Practice uploading a dataset by downloading the csv file found at this link. This is a good time to practice giving your csv file a description name. Load it into this notebook using df = pd.read_csv(...)
. Explore what df.head()
, df.columns
, and df.shape
return.
import pandas as pd
df = pd.read_csv("../data/spotify_dataset.csv",na_values = " ")
df.head()
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 8 | 2021-07-23--2021-07-30 | Beggin' | 48,633,449 | Måneskin | 3377762.0 | 3Wrjm47oTz2sjIgck11l5e | ['indie rock italiano', 'italian pop'] | ... | 0.714 | 0.800 | -4.808 | 0.0504 | 0.1270 | 0.3590 | 134.002 | 211560.0 | 0.589 | B |
1 | 2 | 2 | 3 | 2021-07-23--2021-07-30 | STAY (with Justin Bieber) | 47,248,719 | The Kid LAROI | 2230022.0 | 5HCyWlXZPP0y6Gqq8TgA20 | ['australian hip hop'] | ... | 0.591 | 0.764 | -5.484 | 0.0483 | 0.0383 | 0.1030 | 169.928 | 141806.0 | 0.478 | C#/Db |
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40,162,559 | Olivia Rodrigo | 6266514.0 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.563 | 0.664 | -5.044 | 0.1540 | 0.3350 | 0.0849 | 166.928 | 178147.0 | 0.688 | A |
3 | 4 | 3 | 5 | 2021-07-02--2021-07-09 | Bad Habits | 37,799,456 | Ed Sheeran | 83293380.0 | 6PQ88X9TkUIAUIZJHW2upE | ['pop', 'uk pop'] | ... | 0.808 | 0.897 | -3.712 | 0.0348 | 0.0469 | 0.3640 | 126.026 | 231041.0 | 0.591 | B |
4 | 5 | 5 | 1 | 2021-07-23--2021-07-30 | INDUSTRY BABY (feat. Jack Harlow) | 33,948,454 | Lil Nas X | 5473565.0 | 27NovPIUIRrOZoCHxABJwK | ['lgbtq+ hip hop', 'pop rap'] | ... | 0.736 | 0.704 | -7.409 | 0.0615 | 0.0203 | 0.0501 | 149.995 | 212000.0 | 0.894 | D#/Eb |
5 rows × 23 columns
df.columns
Index(['Index', 'Highest Charting Position', 'Number of Times Charted',
'Week of Highest Charting', 'Song Name', 'Streams', 'Artist',
'Artist Followers', 'Song ID', 'Genre', 'Release Date', 'Weeks Charted',
'Popularity', 'Danceability', 'Energy', 'Loudness', 'Speechiness',
'Acousticness', 'Liveness', 'Tempo', 'Duration (ms)', 'Valence',
'Chord'],
dtype='object')
df.shape
(1556, 23)
Exercise 5: Use info()
to see what data is stored as numerically; then use describe()
to find out the average number of a times a song in the dataset has charted. Write your answers to these questions in a markdown cell.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1556 entries, 0 to 1555
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Index 1556 non-null int64
1 Highest Charting Position 1556 non-null int64
2 Number of Times Charted 1556 non-null int64
3 Week of Highest Charting 1556 non-null object
4 Song Name 1556 non-null object
5 Streams 1556 non-null object
6 Artist 1556 non-null object
7 Artist Followers 1545 non-null float64
8 Song ID 1545 non-null object
9 Genre 1545 non-null object
10 Release Date 1545 non-null object
11 Weeks Charted 1556 non-null object
12 Popularity 1545 non-null float64
13 Danceability 1545 non-null float64
14 Energy 1545 non-null float64
15 Loudness 1545 non-null float64
16 Speechiness 1545 non-null float64
17 Acousticness 1545 non-null float64
18 Liveness 1545 non-null float64
19 Tempo 1545 non-null float64
20 Duration (ms) 1545 non-null float64
21 Valence 1545 non-null float64
22 Chord 1545 non-null object
dtypes: float64(11), int64(3), object(9)
memory usage: 279.7+ KB
df.describe()
Index | Highest Charting Position | Number of Times Charted | Artist Followers | Popularity | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1556.000000 | 1556.000000 | 1556.000000 | 1.545000e+03 | 1545.000000 | 1545.000000 | 1545.000000 | 1545.000000 | 1545.000000 | 1545.000000 | 1545.000000 | 1545.000000 | 1545.000000 | 1545.000000 |
mean | 778.500000 | 87.744216 | 10.668380 | 1.471690e+07 | 70.089320 | 0.689997 | 0.633495 | -6.348474 | 0.123656 | 0.248695 | 0.181202 | 122.811023 | 197940.816828 | 0.514704 |
std | 449.322824 | 58.147225 | 16.360546 | 1.667579e+07 | 15.824034 | 0.142444 | 0.161577 | 2.509281 | 0.110383 | 0.250326 | 0.144071 | 29.591088 | 47148.930420 | 0.227326 |
min | 1.000000 | 1.000000 | 1.000000 | 4.883000e+03 | 0.000000 | 0.150000 | 0.054000 | -25.166000 | 0.023200 | 0.000025 | 0.019700 | 46.718000 | 30133.000000 | 0.032000 |
25% | 389.750000 | 37.000000 | 1.000000 | 2.123734e+06 | 65.000000 | 0.599000 | 0.532000 | -7.491000 | 0.045600 | 0.048500 | 0.096600 | 97.960000 | 169266.000000 | 0.343000 |
50% | 778.500000 | 80.000000 | 4.000000 | 6.852509e+06 | 73.000000 | 0.707000 | 0.642000 | -5.990000 | 0.076500 | 0.161000 | 0.124000 | 122.012000 | 193591.000000 | 0.512000 |
75% | 1167.250000 | 137.000000 | 12.000000 | 2.269875e+07 | 80.000000 | 0.796000 | 0.752000 | -4.711000 | 0.165000 | 0.388000 | 0.217000 | 143.860000 | 218902.000000 | 0.691000 |
max | 1556.000000 | 200.000000 | 142.000000 | 8.333778e+07 | 100.000000 | 0.980000 | 0.970000 | 1.509000 | 0.884000 | 0.994000 | 0.962000 | 205.272000 | 588139.000000 | 0.979000 |
Exercise 6: Using slicing techniques from Monday’s lecture, create a new dataframe which has just the “Song Name” column from the original dataframe.
df2 = df.loc[:,"Song Name"]
Exercise 7: Using value_counts()
, determine how many times each artist appears in the dataset. Then pick an artist and use boolean indexing to find all songs by that artist in the original dataframe.
df["Artist"].value_counts()
Taylor Swift 52
Lil Uzi Vert 32
Justin Bieber 32
Juice WRLD 30
Pop Smoke 29
..
Chris Brown, Young Thug 1
Rauw Alejandro, J Balvin 1
347aidan 1
Migrantes, Alico 1
Dadá Boladão, Tati Zaqui, OIK 1
Name: Artist, Length: 716, dtype: int64
df3 = df[df["Artist"] == "Taylor Swift"]["Song Name"]
df3
398 Mr. Perfectly Fine (Taylor’s Version) (From Th...
421 Love Story (Taylor’s Version)
424 willow
428 You Belong With Me (Taylor’s Version)
429 Fearless (Taylor’s Version)
431 Fifteen (Taylor’s Version)
432 The Way I Loved You (Taylor’s Version)
433 You All Over Me (feat. Maren Morris) (Taylor’s...
434 Hey Stephen (Taylor’s Version)
435 White Horse (Taylor’s Version)
436 Forever & Always (Taylor’s Version)
437 Breathe (feat. Colbie Caillat) (Taylor’s Version)
439 That’s When (feat. Keith Urban) (Taylor’s Vers...
440 Tell Me Why (Taylor’s Version)
441 You’re Not Sorry (Taylor’s Version)
444 Don’t You (Taylor’s Version) (From The Vault)
445 We Were Happy (Taylor’s Version) (From The Vault)
585 champagne problems
608 no body, no crime (feat. HAIM)
667 ‘tis the damn season
671 gold rush
688 Christmas Tree Farm
691 tolerate it
694 happiness
695 ivy
696 dorothea
697 coney island (feat. The National)
698 evermore (feat. Bon Iver)
699 long story short
700 cowboy like me
701 marjorie
702 closure
713 cardigan
889 exile (feat. Bon Iver)
921 the 1
942 august
948 the last great american dynasty
950 my tears ricochet
960 invisible string
965 mirrorball
966 seven
967 this is me trying
968 betty
970 illicit affairs
976 mad woman
977 epiphany
978 peace
983 hoax
1374 You Need To Calm Down
1425 Only The Young - Featured in Miss Americana
1466 ME! (feat. Brendon Urie of Panic! At The Disco)
1555 Lover (Remix) [feat. Shawn Mendes]
Name: Song Name, dtype: object
Getting Started on Homework 1¶
Remember that you can work in groups of 2-3 students on the homework, and you all can submit the same work. Just remember to include the names of your collaborators. Let’s quickly see how to add collaborators to a project.
Thursday we will work on Homework 1 together. It helps if you come prepared to discussion having already found a dataset you would like to use from Kaggle (you will need to create an account). When picking a dataset, here are a few things to keep in mind:
Find a dataset that interests you, but spend the majority of your time working on the homework questions. It can be easy to waste time trying to find the perfect dataset.
The data you use for this homework should be relatively “clean” already (I will show you an example of a dataset that would be a bad choice to use for this homework). We will have opportunities later in the quarter to work on data cleaning.