Week 10 Wednesday
Contents
Week 10 Wednesday#
Announcements#
Please fill out a course evaluation if you haven’t already! (I think that link will work; let me know if it doesn’t.)
I have office hours at 1pm today, in my office, RH 440J.
If you’re behind on the video quizzes, try to catch up today. I plan to convert them to “practice quizzes” and post the total video quiz score Thursday morning.
General plan for lectures this week: About 20 minutes lecture, then time to work on the course project.
I plan to have one Zoom office hour during our scheduled final exam time, 10:30am-11:30am on Monday, December 5th.
Some random project advice#
Where to get ideas?
Browsing Kaggle is the most fun, but it might also be overwhelming. Browsing our class worksheets would be an equally good option. I’ll be very happy if your project shows that you understood material from our worksheets.
What is realistic?
A rule of thumb is that, if a human expert could not do something, then you shouldn’t expect a Machine Learning algorithm to be able to do it. For example, a human expert can predict the price of a house quite accurately. A human expert probably cannot predict the zip code of a house.
That rule of thumb is for advanced machine learning models. You shouldn’t expect something written in a short period of time to match a human expert. Maybe more realistic is to try to do better than random guessing or some other simple baseline algorithm (like always predicting the median value).
What if my project is too short?
One option is just load a different dataset and do something else. (Don’t do the same thing twice… that’s not a good use of time.)
What should I do to get a good grade?
Explain what you’re doing clearly (in markdown cells, not Python comments) and show me what you learned in Math 10. My favorite projects are the ones that clearly use the Math 10 material.
Can you say more about references?
You don’t have to reference my lecture notes, but basically everything else should be referenced (even if you make changes to it). Provide a precise link when possible using this markdown syntax: [text to display](http://www.uci.edu)
which will result in this: text to display. Ask on Ed Discussion if you’re unsure about anything.
More practice with the Spotify dataset#
We also used this dataset on Monday.
import pandas as pd
import altair as alt
df = pd.read_csv("spotify_dataset.csv", na_values=" ").dropna(axis=0).copy()
alt.Chart(df).mark_circle().encode(
x="Energy",
y="Danceability",
color=alt.Color("Valence", scale=alt.Scale(scheme="spectral", reverse=True)),
tooltip=["Artist", "Song Name"],
)
Load the data from this page on GitHub and name the result
df2
.
There are lots of ways to get this data (for example, you could probably copy and paste it into Excel, and then save the Excel file as a csv file). We’ll see a surprisingly easy way.
If you follow the above link, you will notice a button that says raw
near the table. If you click that button, you will get the contents of the csv file, without any formatting. Also notice that the resulting url ends in csv. We save that url as a string here.
# URL for the raw data
url = "https://gist.githubusercontent.com/mbejda/9912f7a366c62c1f296c/raw/dd94a25492b3062f4ca0dc2bb2cdf23fec0896ea/10000-MTV-Music-Artists-page-1.csv"
We can now load the data directly from that website. Notice how we do not need to download the csv file to our computer first.
df2 = pd.read_csv(url)
df2.head()
name | website | genre | mtv | |||
---|---|---|---|---|---|---|
0 | Adele | http://www.facebook.com/9770929278 | http://www.twitter.com/officialadele | NaN | Pop | http://www.mtv.com/artists/adele/biography |
1 | Joey + Rory | http://www.facebook.com/15044507815 | http://www.twitter.com/joeyandrory | NaN | Country | http://www.cmt.com/artists/joey-rory/biography |
2 | Draaco Aventura | http://www.facebook.com/856796091053581 | http://www.twitter.com/DraacoAventura | http://www.bandpage.com/draacoaventura | Pop Latino | http://www.mtv.com/artists/draaco-aventura/bio... |
3 | Justin Bieber | http://www.facebook.com/309570926875 | http://www.twitter.com/justinbieber | http://www.justinbiebermusic.com | Pop | http://www.mtv.com/artists/justin-bieber/biogr... |
4 | Peer van Mladen | http://www.facebook.com/264487966 | http://www.twitter.com/Predrag_Jugovic | http://pejaintergroup.eu/Peer_van_Mladen.html | House | http://www.mtv.com/artists/peer-van-mladen/bio... |
Save that dataset as a csv file, in case it later disappears from GitHub.
When using this to_csv
method, I almost always use index=False
, because most of the datasets I work with do not contain any interesting information in the index.
df2.to_csv("from_github.csv", index=False)
Merge the result into the Spotify dataset.
Remember this merge
method in case you find yourself wanting to combine multiple datasets.
df.sample(3)
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
528 | 529 | 49 | 22 | 2020-11-13--2020-11-20 | Sofia | 4,982,892 | Clairo | 2722638.0 | 7B3z0ySL9Rr0XvZEAjWZzM | ['bedroom pop', 'indie pop', 'pop'] | ... | 0.744 | 0.619 | -9.805 | 0.039 | 0.59800 | 0.2310 | 112.997 | 188387.0 | 0.641 | C |
1536 | 1537 | 125 | 2 | 2019-12-27--2020-01-03 | Writing on the Wall (feat. Post Malone, Cardi ... | 5,229,616 | French Montana | 4039037.0 | 7x9nXsowok1JszkVztI5NI | ['gangster rap', 'hip hop', 'pop rap', 'rap', ... | ... | 0.773 | 0.836 | -2.326 | 0.153 | 0.28300 | 0.0828 | 112.010 | 201271.0 | 0.497 | A |
815 | 816 | 14 | 6 | 2020-09-25--2020-10-02 | FRANCHISE (feat. Young Thug & M.I.A.) | 4,821,213 | Travis Scott | 17732077.0 | 4jVBIpuOiMj1crqd8LoCrJ | ['rap', 'slap house'] | ... | 0.835 | 0.699 | -5.405 | 0.277 | 0.00671 | 0.1950 | 154.981 | 202795.0 | 0.547 | G#/Ab |
3 rows × 23 columns
df2.sample(3)
name | website | genre | mtv | |||
---|---|---|---|---|---|---|
1111 | Andy Graham | NaN | NaN | NaN | NaN | http://www.mtv.com/artists/andy-graham/biography |
2158 | Garou | http://www.facebook.com/11515854604 | http://www.twitter.com/garou_officiel | NaN | World/Reggae | http://www.mtv.com/artists/garou/biography |
2518 | The Ramones | http://www.facebook.com/12789020378 | NaN | http://officialramones.com/ | Rock | http://www.mtv.com/artists/the-ramones/biography |
Notice how both DataFrames contain the artist name, under the columns named “Artist” and “name”, respectively. We try to merge these together. There is not an error, but the resulting DataFrame is empty. (The how="inner"
tells pandas to only keep the values that appear in both DataFrames.)
df.merge(df2, left_on="Artist", right_on="name", how="inner")
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Tempo | Duration (ms) | Valence | Chord | name | website | genre | mtv |
---|
0 rows × 29 columns
Let’s look more closely at df2
. The first “name” that appears is Adele. Does Adele appear in the other DataFrame, df
?
df2.head()
name | website | genre | mtv | |||
---|---|---|---|---|---|---|
0 | Adele | http://www.facebook.com/9770929278 | http://www.twitter.com/officialadele | NaN | Pop | http://www.mtv.com/artists/adele/biography |
1 | Joey + Rory | http://www.facebook.com/15044507815 | http://www.twitter.com/joeyandrory | NaN | Country | http://www.cmt.com/artists/joey-rory/biography |
2 | Draaco Aventura | http://www.facebook.com/856796091053581 | http://www.twitter.com/DraacoAventura | http://www.bandpage.com/draacoaventura | Pop Latino | http://www.mtv.com/artists/draaco-aventura/bio... |
3 | Justin Bieber | http://www.facebook.com/309570926875 | http://www.twitter.com/justinbieber | http://www.justinbiebermusic.com | Pop | http://www.mtv.com/artists/justin-bieber/biogr... |
4 | Peer van Mladen | http://www.facebook.com/264487966 | http://www.twitter.com/Predrag_Jugovic | http://pejaintergroup.eu/Peer_van_Mladen.html | House | http://www.mtv.com/artists/peer-van-mladen/bio... |
The following is counter-intuitive to me. If you ask if something is in a pandas Series, pandas will check if it occurs in the index of that Series.
"Adele" in df["Artist"]
False
Here is a more explicit way to check the same thing.
"Adele" in df["Artist"].index
False
What we really want is to check if Adele occurs in the values of the pandas Series. (Notice how we do not put parentheses after values
. This is different from what you would do with a pandas dictionary.)
"Adele" in df["Artist"].values
True
We’re now back where we started. Adele seems to occur in both DataFrames. Why didn’t our merge
work?
Added after lecture. I see I made a mistake here. I should have done df2["name"].values
, like we were just discussing above! I didn’t notice the mistake because I was expecting to get False
.
"Adele" in df2["name"]
False
Let’s look more closely at the top-left entry. Notice how there are spaces on either side.
df2.loc[0, "name"]
' Adele '
There is a Python string method, strip
, that, if you don’t pass any arguments, will remove whitespace from either end of a string.
" chris ".strip()
'chris'
We want to apply that method to each entry. Using map
is a good idea, but the following does not work, because there is no Python function strip
.
df2["name"].map(strip)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
/tmp/ipykernel_84/2096989577.py in <module>
----> 1 df2["name"].map(strip)
NameError: name 'strip' is not defined
I expected the following to work, but it didn’t because of missing values.
df2["name"].map(lambda x: x.strip())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_84/3552155007.py in <module>
----> 1 df2["name"].map(lambda x: x.strip())
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/series.py in map(self, arg, na_action)
3907 dtype: object
3908 """
-> 3909 new_values = super()._map_values(arg, na_action=na_action)
3910 return self._constructor(new_values, index=self.index).__finalize__(
3911 self, method="map"
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
935
936 # mapper is a function
--> 937 new_values = map_f(values, mapper)
938
939 return new_values
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
/tmp/ipykernel_84/3552155007.py in <lambda>(x)
----> 1 df2["name"].map(lambda x: x.strip())
AttributeError: 'float' object has no attribute 'strip'
Let’s remove the rows where the “name” value is missing. (I didn’t want to use dropna
, because I only care about the “name” column. If something is missing in a different column, I don’t want to remove that row.)
df3 = df2[~df2["name"].isna()]
Now we can use the map
method.
df3["name"].map(lambda x: x.strip())
0 Adele
1 Joey + Rory
2 Draaco Aventura
3 Justin Bieber
4 Peer van Mladen
...
2994 Crosby, Stills, Nash & Young
2995 CRU
2996 Crystal Waters
2997 Crazy Town
2998 Cynthia Fetty
Name: name, Length: 2992, dtype: object
Let’s make a new DataFrame.
df4 = df3.copy()
Let’s replace the “name” column with the stripped version.
df4["name"] = df3["name"].map(lambda x: x.strip())
Now we can finally perform our merge. If you scroll all the way to the right, you will see that the web links have been added to the far right side.
It was some work to get to this stage, and that’s what a lot of data science is like. There are sayings to the effect of, “a data scientist spends 80% of their time cleaning the data”. In this case, the cleaning we did was removing the rows where the “name” value was missing, and also removing the white space from the name strings.
df5 = df.merge(df4, left_on="Artist", right_on="name", how="inner")
df5
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Tempo | Duration (ms) | Valence | Chord | name | website | genre | mtv | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | 3 | 5 | 2021-07-02--2021-07-09 | Bad Habits | 37,799,456 | Ed Sheeran | 83293380.0 | 6PQ88X9TkUIAUIZJHW2upE | ['pop', 'uk pop'] | ... | 126.026 | 231041.0 | 0.591 | B | Ed Sheeran | http://www.facebook.com/9189674485 | http://www.twitter.com/edsheeran | http://www.edsheeran.com/ | Singer/Songwriter | http://www.mtv.com/artists/ed-sheeran/biography |
1 | 117 | 80 | 80 | 2019-12-27--2020-01-03 | Shape of You | 6,452,492 | Ed Sheeran | 83293380.0 | 7qiZfU4dY1lWllzX7mPBI3 | ['pop', 'uk pop'] | ... | 95.977 | 233713.0 | 0.931 | C#/Db | Ed Sheeran | http://www.facebook.com/9189674485 | http://www.twitter.com/edsheeran | http://www.edsheeran.com/ | Singer/Songwriter | http://www.mtv.com/artists/ed-sheeran/biography |
2 | 120 | 99 | 83 | 2020-07-24--2020-07-31 | Perfect | 6,278,765 | Ed Sheeran | 83293380.0 | 0tgVpDi06FyKpA1z0VMD4v | ['pop', 'uk pop'] | ... | 95.050 | 263400.0 | 0.168 | G#/Ab | Ed Sheeran | http://www.facebook.com/9189674485 | http://www.twitter.com/edsheeran | http://www.edsheeran.com/ | Singer/Songwriter | http://www.mtv.com/artists/ed-sheeran/biography |
3 | 427 | 18 | 18 | 2021-01-01--2021-01-08 | Afterglow | 4,965,330 | Ed Sheeran | 1250353.0 | 5dA45onYgYACRp8C5xEOS9 | [] | ... | 110.184 | 185487.0 | 0.273 | B | Ed Sheeran | http://www.facebook.com/9189674485 | http://www.twitter.com/edsheeran | http://www.edsheeran.com/ | Singer/Songwriter | http://www.mtv.com/artists/ed-sheeran/biography |
4 | 530 | 176 | 9 | 2021-02-12--2021-02-19 | Photograph | 4,974,880 | Ed Sheeran | 83337783.0 | 1HNkqx9Ahdgi1Ixy2xkKkL | ['pop', 'uk pop'] | ... | 107.989 | 258987.0 | 0.201 | E | Ed Sheeran | http://www.facebook.com/9189674485 | http://www.twitter.com/edsheeran | http://www.edsheeran.com/ | Singer/Songwriter | http://www.mtv.com/artists/ed-sheeran/biography |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
486 | 1508 | 157 | 1 | 2020-01-17--2020-01-24 | Hands | 5,934,464 | Mac Miller | 6189454.0 | 6CrnvqCxBKVWahSiQwOesM | ['hip hop', 'pittsburgh rap', 'rap'] | ... | 74.962 | 199981.0 | 0.542 | C | Mac Miller | http://www.facebook.com/125173346802 | http://www.twitter.com/macmiller | http://macmillerofficial.com | Hip-Hop/Rap | http://www.mtv.com/artists/mac-miller/biography |
487 | 1437 | 70 | 2 | 2020-01-31--2020-02-07 | I Do It (ft. Big Sean, Lil Baby) | 5,561,205 | Lil Wayne | 10710088.0 | 1bRO28yzxgO3y3UmNR29TZ | ['hip hop', 'new orleans rap', 'pop rap', 'rap... | ... | 138.005 | 184440.0 | 0.321 | C#/Db | Lil Wayne | http://www.facebook.com/LilWayne | https://twitter.com/LilTunechi | http://www.youngmoney.com/ | Hip-Hop/Rap | http://www.mtv.com/artists/lil-wayne/biography |
488 | 1464 | 190 | 1 | 2020-01-31--2020-02-07 | Hips Don't Lie (feat. Wyclef Jean) | 4,918,636 | Shakira | 22136717.0 | 3ZFTkvIE7kyPt6Nu3PEa7V | ['colombian pop', 'dance pop', 'latin', 'latin... | ... | 100.024 | 218093.0 | 0.758 | A#/Bb | Shakira | http://www.facebook.com/5027904559 | http://www.twitter.com/shakira | http://www.shakira.net/ | World/International | http://www.mtv.com/artists/shakira/biography |
489 | 1466 | 191 | 4 | 2020-01-03--2020-01-10 | Going Bad (feat. Drake) | 4,889,500 | Meek Mill | 5241145.0 | 2IRZnDFmlqMuOrYOLnZZyc | ['hip hop', 'philly rap', 'pop rap', 'rap', 's... | ... | 86.003 | 180522.0 | 0.544 | E | Meek Mill | http://www.facebook.com/199361046769455 | http://www.twitter.com/meekmill | http://www.meekmilldreamteam.com | Hip-Hop | http://www.mtv.com/artists/meek-mill/biography |
490 | 1537 | 125 | 2 | 2019-12-27--2020-01-03 | Writing on the Wall (feat. Post Malone, Cardi ... | 5,229,616 | French Montana | 4039037.0 | 7x9nXsowok1JszkVztI5NI | ['gangster rap', 'hip hop', 'pop rap', 'rap', ... | ... | 112.010 | 201271.0 | 0.497 | A | French Montana | http://www.facebook.com/450131130590 | http://www.twitter.com/frenchmontana | http://www.frenchmontanamusic.com | Hip-Hop | http://www.mtv.com/artists/french-montana/biog... |
491 rows × 29 columns
Here we add a new encoding channel, the href
channel. If you click on one of the points, Altair will open the link that is in the “twitter” column for that row.
# On Deepnote, alt+click or command+click to open links
alt.Chart(df5).mark_circle().encode(
x="Energy",
y="Danceability",
color=alt.Color("Valence", scale=alt.Scale(scheme="spectral", reverse=True)),
tooltip=["Artist", "Song Name"],
href="twitter"
)