Week 10 Wednesday#


  • Please fill out a course evaluation if you haven’t already! (I think that link will work; let me know if it doesn’t.)

  • I have office hours at 1pm today, in my office, RH 440J.

  • If you’re behind on the video quizzes, try to catch up today. I plan to convert them to “practice quizzes” and post the total video quiz score Thursday morning.

  • General plan for lectures this week: About 20 minutes lecture, then time to work on the course project.

  • I plan to have one Zoom office hour during our scheduled final exam time, 10:30am-11:30am on Monday, December 5th.

Some random project advice#

  • Where to get ideas?

Browsing Kaggle is the most fun, but it might also be overwhelming. Browsing our class worksheets would be an equally good option. I’ll be very happy if your project shows that you understood material from our worksheets.

  • What is realistic?

A rule of thumb is that, if a human expert could not do something, then you shouldn’t expect a Machine Learning algorithm to be able to do it. For example, a human expert can predict the price of a house quite accurately. A human expert probably cannot predict the zip code of a house.

That rule of thumb is for advanced machine learning models. You shouldn’t expect something written in a short period of time to match a human expert. Maybe more realistic is to try to do better than random guessing or some other simple baseline algorithm (like always predicting the median value).

  • What if my project is too short?

One option is just load a different dataset and do something else. (Don’t do the same thing twice… that’s not a good use of time.)

  • What should I do to get a good grade?

Explain what you’re doing clearly (in markdown cells, not Python comments) and show me what you learned in Math 10. My favorite projects are the ones that clearly use the Math 10 material.

  • Can you say more about references?

You don’t have to reference my lecture notes, but basically everything else should be referenced (even if you make changes to it). Provide a precise link when possible using this markdown syntax: [text to display](http://www.uci.edu) which will result in this: text to display. Ask on Ed Discussion if you’re unsure about anything.

More practice with the Spotify dataset#

We also used this dataset on Monday.

import pandas as pd
import altair as alt
df = pd.read_csv("spotify_dataset.csv", na_values=" ").dropna(axis=0).copy()
    color=alt.Color("Valence", scale=alt.Scale(scheme="spectral", reverse=True)),
    tooltip=["Artist", "Song Name"],
  • Load the data from this page on GitHub and name the result df2.

There are lots of ways to get this data (for example, you could probably copy and paste it into Excel, and then save the Excel file as a csv file). We’ll see a surprisingly easy way.

If you follow the above link, you will notice a button that says raw near the table. If you click that button, you will get the contents of the csv file, without any formatting. Also notice that the resulting url ends in csv. We save that url as a string here.

# URL for the raw data
url = "https://gist.githubusercontent.com/mbejda/9912f7a366c62c1f296c/raw/dd94a25492b3062f4ca0dc2bb2cdf23fec0896ea/10000-MTV-Music-Artists-page-1.csv"

We can now load the data directly from that website. Notice how we do not need to download the csv file to our computer first.

df2 = pd.read_csv(url)
name facebook twitter website genre mtv
0 Adele http://www.facebook.com/9770929278 http://www.twitter.com/officialadele NaN Pop http://www.mtv.com/artists/adele/biography
1 Joey + Rory http://www.facebook.com/15044507815 http://www.twitter.com/joeyandrory NaN Country http://www.cmt.com/artists/joey-rory/biography
2 Draaco Aventura http://www.facebook.com/856796091053581 http://www.twitter.com/DraacoAventura http://www.bandpage.com/draacoaventura Pop Latino http://www.mtv.com/artists/draaco-aventura/bio...
3 Justin Bieber http://www.facebook.com/309570926875 http://www.twitter.com/justinbieber http://www.justinbiebermusic.com Pop http://www.mtv.com/artists/justin-bieber/biogr...
4 Peer van Mladen http://www.facebook.com/264487966 http://www.twitter.com/Predrag_Jugovic http://pejaintergroup.eu/Peer_van_Mladen.html House http://www.mtv.com/artists/peer-van-mladen/bio...
  • Save that dataset as a csv file, in case it later disappears from GitHub.

When using this to_csv method, I almost always use index=False, because most of the datasets I work with do not contain any interesting information in the index.

df2.to_csv("from_github.csv", index=False)
  • Merge the result into the Spotify dataset.

Remember this merge method in case you find yourself wanting to combine multiple datasets.

Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
528 529 49 22 2020-11-13--2020-11-20 Sofia 4,982,892 Clairo 2722638.0 7B3z0ySL9Rr0XvZEAjWZzM ['bedroom pop', 'indie pop', 'pop'] ... 0.744 0.619 -9.805 0.039 0.59800 0.2310 112.997 188387.0 0.641 C
1536 1537 125 2 2019-12-27--2020-01-03 Writing on the Wall (feat. Post Malone, Cardi ... 5,229,616 French Montana 4039037.0 7x9nXsowok1JszkVztI5NI ['gangster rap', 'hip hop', 'pop rap', 'rap', ... ... 0.773 0.836 -2.326 0.153 0.28300 0.0828 112.010 201271.0 0.497 A
815 816 14 6 2020-09-25--2020-10-02 FRANCHISE (feat. Young Thug & M.I.A.) 4,821,213 Travis Scott 17732077.0 4jVBIpuOiMj1crqd8LoCrJ ['rap', 'slap house'] ... 0.835 0.699 -5.405 0.277 0.00671 0.1950 154.981 202795.0 0.547 G#/Ab

3 rows × 23 columns

name facebook twitter website genre mtv
1111 Andy Graham NaN NaN NaN NaN http://www.mtv.com/artists/andy-graham/biography
2158 Garou http://www.facebook.com/11515854604 http://www.twitter.com/garou_officiel NaN World/Reggae http://www.mtv.com/artists/garou/biography
2518 The Ramones http://www.facebook.com/12789020378 NaN http://officialramones.com/ Rock http://www.mtv.com/artists/the-ramones/biography

Notice how both DataFrames contain the artist name, under the columns named “Artist” and “name”, respectively. We try to merge these together. There is not an error, but the resulting DataFrame is empty. (The how="inner" tells pandas to only keep the values that appear in both DataFrames.)

df.merge(df2, left_on="Artist", right_on="name", how="inner")
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Tempo Duration (ms) Valence Chord name facebook twitter website genre mtv

0 rows × 29 columns

Let’s look more closely at df2. The first “name” that appears is Adele. Does Adele appear in the other DataFrame, df?

name facebook twitter website genre mtv
0 Adele http://www.facebook.com/9770929278 http://www.twitter.com/officialadele NaN Pop http://www.mtv.com/artists/adele/biography
1 Joey + Rory http://www.facebook.com/15044507815 http://www.twitter.com/joeyandrory NaN Country http://www.cmt.com/artists/joey-rory/biography
2 Draaco Aventura http://www.facebook.com/856796091053581 http://www.twitter.com/DraacoAventura http://www.bandpage.com/draacoaventura Pop Latino http://www.mtv.com/artists/draaco-aventura/bio...
3 Justin Bieber http://www.facebook.com/309570926875 http://www.twitter.com/justinbieber http://www.justinbiebermusic.com Pop http://www.mtv.com/artists/justin-bieber/biogr...
4 Peer van Mladen http://www.facebook.com/264487966 http://www.twitter.com/Predrag_Jugovic http://pejaintergroup.eu/Peer_van_Mladen.html House http://www.mtv.com/artists/peer-van-mladen/bio...

The following is counter-intuitive to me. If you ask if something is in a pandas Series, pandas will check if it occurs in the index of that Series.

"Adele" in df["Artist"]

Here is a more explicit way to check the same thing.

"Adele" in df["Artist"].index

What we really want is to check if Adele occurs in the values of the pandas Series. (Notice how we do not put parentheses after values. This is different from what you would do with a pandas dictionary.)

"Adele" in df["Artist"].values

We’re now back where we started. Adele seems to occur in both DataFrames. Why didn’t our merge work?

Added after lecture. I see I made a mistake here. I should have done df2["name"].values, like we were just discussing above! I didn’t notice the mistake because I was expecting to get False.

"Adele" in df2["name"]

Let’s look more closely at the top-left entry. Notice how there are spaces on either side.

df2.loc[0, "name"]
' Adele '

There is a Python string method, strip, that, if you don’t pass any arguments, will remove whitespace from either end of a string.

" chris    ".strip()

We want to apply that method to each entry. Using map is a good idea, but the following does not work, because there is no Python function strip.

NameError                                 Traceback (most recent call last)
/tmp/ipykernel_84/2096989577.py in <module>
----> 1 df2["name"].map(strip)

NameError: name 'strip' is not defined

I expected the following to work, but it didn’t because of missing values.

df2["name"].map(lambda x: x.strip())
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_84/3552155007.py in <module>
----> 1 df2["name"].map(lambda x: x.strip())

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/series.py in map(self, arg, na_action)
   3907         dtype: object
   3908         """
-> 3909         new_values = super()._map_values(arg, na_action=na_action)
   3910         return self._constructor(new_values, index=self.index).__finalize__(
   3911             self, method="map"

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
    936         # mapper is a function
--> 937         new_values = map_f(values, mapper)
    939         return new_values

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

/tmp/ipykernel_84/3552155007.py in <lambda>(x)
----> 1 df2["name"].map(lambda x: x.strip())

AttributeError: 'float' object has no attribute 'strip'

Let’s remove the rows where the “name” value is missing. (I didn’t want to use dropna, because I only care about the “name” column. If something is missing in a different column, I don’t want to remove that row.)

df3 = df2[~df2["name"].isna()]

Now we can use the map method.

df3["name"].map(lambda x: x.strip())
0                              Adele
1                        Joey + Rory
2                    Draaco Aventura
3                      Justin Bieber
4                    Peer van Mladen
2994    Crosby, Stills, Nash & Young
2995                             CRU
2996                  Crystal Waters
2997                      Crazy Town
2998                   Cynthia Fetty
Name: name, Length: 2992, dtype: object

Let’s make a new DataFrame.

df4 = df3.copy()

Let’s replace the “name” column with the stripped version.

df4["name"] = df3["name"].map(lambda x: x.strip())

Now we can finally perform our merge. If you scroll all the way to the right, you will see that the web links have been added to the far right side.

It was some work to get to this stage, and that’s what a lot of data science is like. There are sayings to the effect of, “a data scientist spends 80% of their time cleaning the data”. In this case, the cleaning we did was removing the rows where the “name” value was missing, and also removing the white space from the name strings.

df5 = df.merge(df4, left_on="Artist", right_on="name", how="inner")
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Tempo Duration (ms) Valence Chord name facebook twitter website genre mtv
0 4 3 5 2021-07-02--2021-07-09 Bad Habits 37,799,456 Ed Sheeran 83293380.0 6PQ88X9TkUIAUIZJHW2upE ['pop', 'uk pop'] ... 126.026 231041.0 0.591 B Ed Sheeran http://www.facebook.com/9189674485 http://www.twitter.com/edsheeran http://www.edsheeran.com/ Singer/Songwriter http://www.mtv.com/artists/ed-sheeran/biography
1 117 80 80 2019-12-27--2020-01-03 Shape of You 6,452,492 Ed Sheeran 83293380.0 7qiZfU4dY1lWllzX7mPBI3 ['pop', 'uk pop'] ... 95.977 233713.0 0.931 C#/Db Ed Sheeran http://www.facebook.com/9189674485 http://www.twitter.com/edsheeran http://www.edsheeran.com/ Singer/Songwriter http://www.mtv.com/artists/ed-sheeran/biography
2 120 99 83 2020-07-24--2020-07-31 Perfect 6,278,765 Ed Sheeran 83293380.0 0tgVpDi06FyKpA1z0VMD4v ['pop', 'uk pop'] ... 95.050 263400.0 0.168 G#/Ab Ed Sheeran http://www.facebook.com/9189674485 http://www.twitter.com/edsheeran http://www.edsheeran.com/ Singer/Songwriter http://www.mtv.com/artists/ed-sheeran/biography
3 427 18 18 2021-01-01--2021-01-08 Afterglow 4,965,330 Ed Sheeran 1250353.0 5dA45onYgYACRp8C5xEOS9 [] ... 110.184 185487.0 0.273 B Ed Sheeran http://www.facebook.com/9189674485 http://www.twitter.com/edsheeran http://www.edsheeran.com/ Singer/Songwriter http://www.mtv.com/artists/ed-sheeran/biography
4 530 176 9 2021-02-12--2021-02-19 Photograph 4,974,880 Ed Sheeran 83337783.0 1HNkqx9Ahdgi1Ixy2xkKkL ['pop', 'uk pop'] ... 107.989 258987.0 0.201 E Ed Sheeran http://www.facebook.com/9189674485 http://www.twitter.com/edsheeran http://www.edsheeran.com/ Singer/Songwriter http://www.mtv.com/artists/ed-sheeran/biography
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
486 1508 157 1 2020-01-17--2020-01-24 Hands 5,934,464 Mac Miller 6189454.0 6CrnvqCxBKVWahSiQwOesM ['hip hop', 'pittsburgh rap', 'rap'] ... 74.962 199981.0 0.542 C Mac Miller http://www.facebook.com/125173346802 http://www.twitter.com/macmiller http://macmillerofficial.com Hip-Hop/Rap http://www.mtv.com/artists/mac-miller/biography
487 1437 70 2 2020-01-31--2020-02-07 I Do It (ft. Big Sean, Lil Baby) 5,561,205 Lil Wayne 10710088.0 1bRO28yzxgO3y3UmNR29TZ ['hip hop', 'new orleans rap', 'pop rap', 'rap... ... 138.005 184440.0 0.321 C#/Db Lil Wayne http://www.facebook.com/LilWayne https://twitter.com/LilTunechi http://www.youngmoney.com/ Hip-Hop/Rap http://www.mtv.com/artists/lil-wayne/biography
488 1464 190 1 2020-01-31--2020-02-07 Hips Don't Lie (feat. Wyclef Jean) 4,918,636 Shakira 22136717.0 3ZFTkvIE7kyPt6Nu3PEa7V ['colombian pop', 'dance pop', 'latin', 'latin... ... 100.024 218093.0 0.758 A#/Bb Shakira http://www.facebook.com/5027904559 http://www.twitter.com/shakira http://www.shakira.net/ World/International http://www.mtv.com/artists/shakira/biography
489 1466 191 4 2020-01-03--2020-01-10 Going Bad (feat. Drake) 4,889,500 Meek Mill 5241145.0 2IRZnDFmlqMuOrYOLnZZyc ['hip hop', 'philly rap', 'pop rap', 'rap', 's... ... 86.003 180522.0 0.544 E Meek Mill http://www.facebook.com/199361046769455 http://www.twitter.com/meekmill http://www.meekmilldreamteam.com Hip-Hop http://www.mtv.com/artists/meek-mill/biography
490 1537 125 2 2019-12-27--2020-01-03 Writing on the Wall (feat. Post Malone, Cardi ... 5,229,616 French Montana 4039037.0 7x9nXsowok1JszkVztI5NI ['gangster rap', 'hip hop', 'pop rap', 'rap', ... ... 112.010 201271.0 0.497 A French Montana http://www.facebook.com/450131130590 http://www.twitter.com/frenchmontana http://www.frenchmontanamusic.com Hip-Hop http://www.mtv.com/artists/french-montana/biog...

491 rows × 29 columns

Here we add a new encoding channel, the href channel. If you click on one of the points, Altair will open the link that is in the “twitter” column for that row.

# On Deepnote, alt+click or command+click to open links
    color=alt.Color("Valence", scale=alt.Scale(scheme="spectral", reverse=True)),
    tooltip=["Artist", "Song Name"],