Week 10 Wednesday#

Announcements#

Please fill out a course evaluation if you haven’t already! (I think that link will work; let me know if it doesn’t.)
I have office hours at 1pm today, in my office, RH 440J.
If you’re behind on the video quizzes, try to catch up today. I plan to convert them to “practice quizzes” and post the total video quiz score Thursday morning.
General plan for lectures this week: About 20 minutes lecture, then time to work on the course project.
I plan to have one Zoom office hour during our scheduled final exam time, 10:30am-11:30am on Monday, December 5th.

Some random project advice#

Where to get ideas?

Browsing Kaggle is the most fun, but it might also be overwhelming. Browsing our class worksheets would be an equally good option. I’ll be very happy if your project shows that you understood material from our worksheets.

What is realistic?

A rule of thumb is that, if a human expert could not do something, then you shouldn’t expect a Machine Learning algorithm to be able to do it. For example, a human expert can predict the price of a house quite accurately. A human expert probably cannot predict the zip code of a house.

That rule of thumb is for advanced machine learning models. You shouldn’t expect something written in a short period of time to match a human expert. Maybe more realistic is to try to do better than random guessing or some other simple baseline algorithm (like always predicting the median value).

What if my project is too short?

One option is just load a different dataset and do something else. (Don’t do the same thing twice… that’s not a good use of time.)

What should I do to get a good grade?

Explain what you’re doing clearly (in markdown cells, not Python comments) and show me what you learned in Math 10. My favorite projects are the ones that clearly use the Math 10 material.

Can you say more about references?

You don’t have to reference my lecture notes, but basically everything else should be referenced (even if you make changes to it). Provide a precise link when possible using this markdown syntax: [text to display](http://www.uci.edu) which will result in this: text to display. Ask on Ed Discussion if you’re unsure about anything.

More practice with the Spotify dataset#

We also used this dataset on Monday.

import pandas as pd
import altair as alt

df = pd.read_csv("spotify_dataset.csv", na_values=" ").dropna(axis=0).copy()

alt.Chart(df).mark_circle().encode(
    x="Energy",
    y="Danceability",
    color=alt.Color("Valence", scale=alt.Scale(scheme="spectral", reverse=True)),
    tooltip=["Artist", "Song Name"],
)

Load the data from this page on GitHub and name the result df2.

There are lots of ways to get this data (for example, you could probably copy and paste it into Excel, and then save the Excel file as a csv file). We’ll see a surprisingly easy way.

If you follow the above link, you will notice a button that says raw near the table. If you click that button, you will get the contents of the csv file, without any formatting. Also notice that the resulting url ends in csv. We save that url as a string here.

# URL for the raw data
url = "https://gist.githubusercontent.com/mbejda/9912f7a366c62c1f296c/raw/dd94a25492b3062f4ca0dc2bb2cdf23fec0896ea/10000-MTV-Music-Artists-page-1.csv"

We can now load the data directly from that website. Notice how we do not need to download the csv file to our computer first.

df2 = pd.read_csv(url)

df2.head()

	name	facebook	twitter	website	genre	mtv
0	Adele	http://www.facebook.com/9770929278	http://www.twitter.com/officialadele	NaN	Pop	http://www.mtv.com/artists/adele/biography
1	Joey + Rory	http://www.facebook.com/15044507815	http://www.twitter.com/joeyandrory	NaN	Country	http://www.cmt.com/artists/joey-rory/biography
2	Draaco Aventura	http://www.facebook.com/856796091053581	http://www.twitter.com/DraacoAventura	http://www.bandpage.com/draacoaventura	Pop Latino	http://www.mtv.com/artists/draaco-aventura/bio...
3	Justin Bieber	http://www.facebook.com/309570926875	http://www.twitter.com/justinbieber	http://www.justinbiebermusic.com	Pop	http://www.mtv.com/artists/justin-bieber/biogr...
4	Peer van Mladen	http://www.facebook.com/264487966	http://www.twitter.com/Predrag_Jugovic	http://pejaintergroup.eu/Peer_van_Mladen.html	House	http://www.mtv.com/artists/peer-van-mladen/bio...

Save that dataset as a csv file, in case it later disappears from GitHub.

When using this to_csv method, I almost always use index=False, because most of the datasets I work with do not contain any interesting information in the index.

df2.to_csv("from_github.csv", index=False)

Merge the result into the Spotify dataset.

Remember this merge method in case you find yourself wanting to combine multiple datasets.

df.sample(3)

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Danceability	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	Chord
528	529	49	22	2020-11-13--2020-11-20	Sofia	4,982,892	Clairo	2722638.0	7B3z0ySL9Rr0XvZEAjWZzM	['bedroom pop', 'indie pop', 'pop']	...	0.744	0.619	-9.805	0.039	0.59800	0.2310	112.997	188387.0	0.641	C
1536	1537	125	2	2019-12-27--2020-01-03	Writing on the Wall (feat. Post Malone, Cardi ...	5,229,616	French Montana	4039037.0	7x9nXsowok1JszkVztI5NI	['gangster rap', 'hip hop', 'pop rap', 'rap', ...	...	0.773	0.836	-2.326	0.153	0.28300	0.0828	112.010	201271.0	0.497	A
815	816	14	6	2020-09-25--2020-10-02	FRANCHISE (feat. Young Thug & M.I.A.)	4,821,213	Travis Scott	17732077.0	4jVBIpuOiMj1crqd8LoCrJ	['rap', 'slap house']	...	0.835	0.699	-5.405	0.277	0.00671	0.1950	154.981	202795.0	0.547	G#/Ab

3 rows × 23 columns

df2.sample(3)

	name	facebook	twitter	website	genre	mtv
1111	Andy Graham	NaN	NaN	NaN	NaN	http://www.mtv.com/artists/andy-graham/biography
2158	Garou	http://www.facebook.com/11515854604	http://www.twitter.com/garou_officiel	NaN	World/Reggae	http://www.mtv.com/artists/garou/biography
2518	The Ramones	http://www.facebook.com/12789020378	NaN	http://officialramones.com/	Rock	http://www.mtv.com/artists/the-ramones/biography

Notice how both DataFrames contain the artist name, under the columns named “Artist” and “name”, respectively. We try to merge these together. There is not an error, but the resulting DataFrame is empty. (The how="inner" tells pandas to only keep the values that appear in both DataFrames.)

df.merge(df2, left_on="Artist", right_on="name", how="inner")

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Tempo	Duration (ms)	Valence	Chord	name	facebook	twitter	website	genre	mtv

0 rows × 29 columns

Let’s look more closely at df2. The first “name” that appears is Adele. Does Adele appear in the other DataFrame, df?

df2.head()

	name	facebook	twitter	website	genre	mtv
0	Adele	http://www.facebook.com/9770929278	http://www.twitter.com/officialadele	NaN	Pop	http://www.mtv.com/artists/adele/biography
1	Joey + Rory	http://www.facebook.com/15044507815	http://www.twitter.com/joeyandrory	NaN	Country	http://www.cmt.com/artists/joey-rory/biography
2	Draaco Aventura	http://www.facebook.com/856796091053581	http://www.twitter.com/DraacoAventura	http://www.bandpage.com/draacoaventura	Pop Latino	http://www.mtv.com/artists/draaco-aventura/bio...
3	Justin Bieber	http://www.facebook.com/309570926875	http://www.twitter.com/justinbieber	http://www.justinbiebermusic.com	Pop	http://www.mtv.com/artists/justin-bieber/biogr...
4	Peer van Mladen	http://www.facebook.com/264487966	http://www.twitter.com/Predrag_Jugovic	http://pejaintergroup.eu/Peer_van_Mladen.html	House	http://www.mtv.com/artists/peer-van-mladen/bio...

The following is counter-intuitive to me. If you ask if something is in a pandas Series, pandas will check if it occurs in the index of that Series.

"Adele" in df["Artist"]

False

Here is a more explicit way to check the same thing.

"Adele" in df["Artist"].index

False

What we really want is to check if Adele occurs in the values of the pandas Series. (Notice how we do not put parentheses after values. This is different from what you would do with a pandas dictionary.)

"Adele" in df["Artist"].values

True

We’re now back where we started. Adele seems to occur in both DataFrames. Why didn’t our merge work?

Added after lecture. I see I made a mistake here. I should have done df2["name"].values, like we were just discussing above! I didn’t notice the mistake because I was expecting to get False.

"Adele" in df2["name"]

False

Let’s look more closely at the top-left entry. Notice how there are spaces on either side.

df2.loc[0, "name"]

' Adele '

There is a Python string method, strip, that, if you don’t pass any arguments, will remove whitespace from either end of a string.

" chris    ".strip()

'chris'

We want to apply that method to each entry. Using map is a good idea, but the following does not work, because there is no Python function strip.

df2["name"].map(strip)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_84/2096989577.py in <module>
----> 1 df2["name"].map(strip)

NameError: name 'strip' is not defined

I expected the following to work, but it didn’t because of missing values.

df2["name"].map(lambda x: x.strip())

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_84/3552155007.py in <module>
----> 1 df2["name"].map(lambda x: x.strip())

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/series.py in map(self, arg, na_action)
   3907         dtype: object
   3908         """
-> 3909         new_values = super()._map_values(arg, na_action=na_action)
   3910         return self._constructor(new_values, index=self.index).__finalize__(
   3911             self, method="map"

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
    935 
    936         # mapper is a function
--> 937         new_values = map_f(values, mapper)
    938 
    939         return new_values

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

/tmp/ipykernel_84/3552155007.py in <lambda>(x)
----> 1 df2["name"].map(lambda x: x.strip())

AttributeError: 'float' object has no attribute 'strip'

Let’s remove the rows where the “name” value is missing. (I didn’t want to use dropna, because I only care about the “name” column. If something is missing in a different column, I don’t want to remove that row.)

df3 = df2[~df2["name"].isna()]

Now we can use the map method.

df3["name"].map(lambda x: x.strip())

                            Adele
                      Joey + Rory
                  Draaco Aventura
                    Justin Bieber
                  Peer van Mladen
                    ...             
  Crosby, Stills, Nash & Young
                           CRU
                Crystal Waters
                    Crazy Town
                 Cynthia Fetty
Name: name, Length: 2992, dtype: object

Let’s make a new DataFrame.

df4 = df3.copy()

Let’s replace the “name” column with the stripped version.

df4["name"] = df3["name"].map(lambda x: x.strip())

Now we can finally perform our merge. If you scroll all the way to the right, you will see that the web links have been added to the far right side.

It was some work to get to this stage, and that’s what a lot of data science is like. There are sayings to the effect of, “a data scientist spends 80% of their time cleaning the data”. In this case, the cleaning we did was removing the rows where the “name” value was missing, and also removing the white space from the name strings.

df5 = df.merge(df4, left_on="Artist", right_on="name", how="inner")
df5

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Tempo	Duration (ms)	Valence	Chord	name	facebook	twitter	website	genre	mtv
0	4	3	5	2021-07-02--2021-07-09	Bad Habits	37,799,456	Ed Sheeran	83293380.0	6PQ88X9TkUIAUIZJHW2upE	['pop', 'uk pop']	...	126.026	231041.0	0.591	B	Ed Sheeran	http://www.facebook.com/9189674485	http://www.twitter.com/edsheeran	http://www.edsheeran.com/	Singer/Songwriter	http://www.mtv.com/artists/ed-sheeran/biography
1	117	80	80	2019-12-27--2020-01-03	Shape of You	6,452,492	Ed Sheeran	83293380.0	7qiZfU4dY1lWllzX7mPBI3	['pop', 'uk pop']	...	95.977	233713.0	0.931	C#/Db	Ed Sheeran	http://www.facebook.com/9189674485	http://www.twitter.com/edsheeran	http://www.edsheeran.com/	Singer/Songwriter	http://www.mtv.com/artists/ed-sheeran/biography
2	120	99	83	2020-07-24--2020-07-31	Perfect	6,278,765	Ed Sheeran	83293380.0	0tgVpDi06FyKpA1z0VMD4v	['pop', 'uk pop']	...	95.050	263400.0	0.168	G#/Ab	Ed Sheeran	http://www.facebook.com/9189674485	http://www.twitter.com/edsheeran	http://www.edsheeran.com/	Singer/Songwriter	http://www.mtv.com/artists/ed-sheeran/biography
3	427	18	18	2021-01-01--2021-01-08	Afterglow	4,965,330	Ed Sheeran	1250353.0	5dA45onYgYACRp8C5xEOS9	[]	...	110.184	185487.0	0.273	B	Ed Sheeran	http://www.facebook.com/9189674485	http://www.twitter.com/edsheeran	http://www.edsheeran.com/	Singer/Songwriter	http://www.mtv.com/artists/ed-sheeran/biography
4	530	176	9	2021-02-12--2021-02-19	Photograph	4,974,880	Ed Sheeran	83337783.0	1HNkqx9Ahdgi1Ixy2xkKkL	['pop', 'uk pop']	...	107.989	258987.0	0.201	E	Ed Sheeran	http://www.facebook.com/9189674485	http://www.twitter.com/edsheeran	http://www.edsheeran.com/	Singer/Songwriter	http://www.mtv.com/artists/ed-sheeran/biography
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
486	1508	157	1	2020-01-17--2020-01-24	Hands	5,934,464	Mac Miller	6189454.0	6CrnvqCxBKVWahSiQwOesM	['hip hop', 'pittsburgh rap', 'rap']	...	74.962	199981.0	0.542	C	Mac Miller	http://www.facebook.com/125173346802	http://www.twitter.com/macmiller	http://macmillerofficial.com	Hip-Hop/Rap	http://www.mtv.com/artists/mac-miller/biography
487	1437	70	2	2020-01-31--2020-02-07	I Do It (ft. Big Sean, Lil Baby)	5,561,205	Lil Wayne	10710088.0	1bRO28yzxgO3y3UmNR29TZ	['hip hop', 'new orleans rap', 'pop rap', 'rap...	...	138.005	184440.0	0.321	C#/Db	Lil Wayne	http://www.facebook.com/LilWayne	https://twitter.com/LilTunechi	http://www.youngmoney.com/	Hip-Hop/Rap	http://www.mtv.com/artists/lil-wayne/biography
488	1464	190	1	2020-01-31--2020-02-07	Hips Don't Lie (feat. Wyclef Jean)	4,918,636	Shakira	22136717.0	3ZFTkvIE7kyPt6Nu3PEa7V	['colombian pop', 'dance pop', 'latin', 'latin...	...	100.024	218093.0	0.758	A#/Bb	Shakira	http://www.facebook.com/5027904559	http://www.twitter.com/shakira	http://www.shakira.net/	World/International	http://www.mtv.com/artists/shakira/biography
489	1466	191	4	2020-01-03--2020-01-10	Going Bad (feat. Drake)	4,889,500	Meek Mill	5241145.0	2IRZnDFmlqMuOrYOLnZZyc	['hip hop', 'philly rap', 'pop rap', 'rap', 's...	...	86.003	180522.0	0.544	E	Meek Mill	http://www.facebook.com/199361046769455	http://www.twitter.com/meekmill	http://www.meekmilldreamteam.com	Hip-Hop	http://www.mtv.com/artists/meek-mill/biography
490	1537	125	2	2019-12-27--2020-01-03	Writing on the Wall (feat. Post Malone, Cardi ...	5,229,616	French Montana	4039037.0	7x9nXsowok1JszkVztI5NI	['gangster rap', 'hip hop', 'pop rap', 'rap', ...	...	112.010	201271.0	0.497	A	French Montana	http://www.facebook.com/450131130590	http://www.twitter.com/frenchmontana	http://www.frenchmontanamusic.com	Hip-Hop	http://www.mtv.com/artists/french-montana/biog...

491 rows × 29 columns

Here we add a new encoding channel, the href channel. If you click on one of the points, Altair will open the link that is in the “twitter” column for that row.

# On Deepnote, alt+click or command+click to open links
alt.Chart(df5).mark_circle().encode(
    x="Energy",
    y="Danceability",
    color=alt.Color("Valence", scale=alt.Scale(scheme="spectral", reverse=True)),
    tooltip=["Artist", "Song Name"],
    href="twitter"
)

UC Irvine Math 10, Fall 2022

Week 10 Wednesday

Contents

Week 10 Wednesday#

Announcements#

Some random project advice#

More practice with the Spotify dataset#