Functions in Python
Contents
Functions in Python¶
Announcement¶
My usual office hours are cancelled this week (and Week 7). Replaced by Tuesday 3-4pm on Zoom.
Warm-up¶
(This part doesn’t have anything to do with functions, but we will use the Spotify dataset later.)
Import the attached Spotify dataset, specifying that missing values are denoted in this csv file by a blank space.
Drop all rows containing
nan
values usingisna
andany
.Check your answer: The DataFrame should have 1545 rows, and 9 of the 23 columns should have “object” as their
dtype
.
import pandas as pd
df = pd.read_csv("../data/spotify_dataset.csv")
It looks like there are no missing values in this dataset:
df["Energy"]
0 0.8
1 0.764
2 0.664
3 0.897
4 0.704
...
1551 0.7
1552 0.87
1553 0.523
1554 0.55
1555 0.603
Name: Energy, Length: 1556, dtype: object
df.isna().any(axis=0)
Index False
Highest Charting Position False
Number of Times Charted False
Week of Highest Charting False
Song Name False
Streams False
Artist False
Artist Followers False
Song ID False
Genre False
Release Date False
Weeks Charted False
Popularity False
Danceability False
Energy False
Loudness False
Speechiness False
Acousticness False
Liveness False
Tempo False
Duration (ms) False
Valence False
Chord False
dtype: bool
Here is an attempt to find the missing values. The entries are strings.
df["Energy"]
0 0.8
1 0.764
2 0.664
3 0.897
4 0.704
...
1551 0.7
1552 0.87
1553 0.523
1554 0.55
1555 0.603
Name: Energy, Length: 1556, dtype: object
[float(x) for x in df["Energy"]]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/3029641965.py in <module>
----> 1 [float(x) for x in df["Energy"]]
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/3029641965.py in <listcomp>(.0)
----> 1 [float(x) for x in df["Energy"]]
ValueError: could not convert string to float:
It is difficult to tell from that error message, but after the colon symbol : is a blank space. The blank spaces in this dataset represent the missing values.
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
Now there are many columns with missing values.
df.isna().any(axis=0)
Index False
Highest Charting Position False
Number of Times Charted False
Week of Highest Charting False
Song Name False
Streams False
Artist False
Artist Followers True
Song ID True
Genre True
Release Date True
Weeks Charted False
Popularity True
Danceability True
Energy True
Loudness True
Speechiness True
Acousticness True
Liveness True
Tempo True
Duration (ms) True
Valence True
Chord True
dtype: bool
Here are the rows with missing values.
# Boolean Series
df.isna().any(axis=1)
0 False
1 False
2 False
3 False
4 False
...
1551 False
1552 False
1553 False
1554 False
1555 False
Length: 1556, dtype: bool
The rows we want to keep will be the negation of the above Series.
~df.isna().any(axis=1)
0 True
1 True
2 True
3 True
4 True
...
1551 True
1552 True
1553 True
1554 True
1555 True
Length: 1556, dtype: bool
# Boolean indexing
df[~df.isna().any(axis=1)]
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 8 | 2021-07-23--2021-07-30 | Beggin' | 48,633,449 | Måneskin | 3377762.0 | 3Wrjm47oTz2sjIgck11l5e | ['indie rock italiano', 'italian pop'] | ... | 0.714 | 0.800 | -4.808 | 0.0504 | 0.12700 | 0.3590 | 134.002 | 211560.0 | 0.589 | B |
1 | 2 | 2 | 3 | 2021-07-23--2021-07-30 | STAY (with Justin Bieber) | 47,248,719 | The Kid LAROI | 2230022.0 | 5HCyWlXZPP0y6Gqq8TgA20 | ['australian hip hop'] | ... | 0.591 | 0.764 | -5.484 | 0.0483 | 0.03830 | 0.1030 | 169.928 | 141806.0 | 0.478 | C#/Db |
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40,162,559 | Olivia Rodrigo | 6266514.0 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.563 | 0.664 | -5.044 | 0.1540 | 0.33500 | 0.0849 | 166.928 | 178147.0 | 0.688 | A |
3 | 4 | 3 | 5 | 2021-07-02--2021-07-09 | Bad Habits | 37,799,456 | Ed Sheeran | 83293380.0 | 6PQ88X9TkUIAUIZJHW2upE | ['pop', 'uk pop'] | ... | 0.808 | 0.897 | -3.712 | 0.0348 | 0.04690 | 0.3640 | 126.026 | 231041.0 | 0.591 | B |
4 | 5 | 5 | 1 | 2021-07-23--2021-07-30 | INDUSTRY BABY (feat. Jack Harlow) | 33,948,454 | Lil Nas X | 5473565.0 | 27NovPIUIRrOZoCHxABJwK | ['lgbtq+ hip hop', 'pop rap'] | ... | 0.736 | 0.704 | -7.409 | 0.0615 | 0.02030 | 0.0501 | 149.995 | 212000.0 | 0.894 | D#/Eb |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1551 | 1552 | 195 | 1 | 2019-12-27--2020-01-03 | New Rules | 4,630,675 | Dua Lipa | 27167675.0 | 2ekn2ttSfGqwhhate0LSR0 | ['dance pop', 'pop', 'uk pop'] | ... | 0.762 | 0.700 | -6.021 | 0.0694 | 0.00261 | 0.1530 | 116.073 | 209320.0 | 0.608 | A |
1552 | 1553 | 196 | 1 | 2019-12-27--2020-01-03 | Cheirosa - Ao Vivo | 4,623,030 | Jorge & Mateus | 15019109.0 | 2PWjKmjyTZeDpmOUa3a5da | ['sertanejo', 'sertanejo universitario'] | ... | 0.528 | 0.870 | -3.123 | 0.0851 | 0.24000 | 0.3330 | 152.370 | 181930.0 | 0.714 | B |
1553 | 1554 | 197 | 1 | 2019-12-27--2020-01-03 | Havana (feat. Young Thug) | 4,620,876 | Camila Cabello | 22698747.0 | 1rfofaqEpACxVEHIZBJe6W | ['dance pop', 'electropop', 'pop', 'post-teen ... | ... | 0.765 | 0.523 | -4.333 | 0.0300 | 0.18400 | 0.1320 | 104.988 | 217307.0 | 0.394 | D |
1554 | 1555 | 198 | 1 | 2019-12-27--2020-01-03 | Surtada - Remix Brega Funk | 4,607,385 | Dadá Boladão, Tati Zaqui, OIK | 208630.0 | 5F8ffc8KWKNawllr5WsW0r | ['brega funk', 'funk carioca'] | ... | 0.832 | 0.550 | -7.026 | 0.0587 | 0.24900 | 0.1820 | 154.064 | 152784.0 | 0.881 | F |
1555 | 1556 | 199 | 1 | 2019-12-27--2020-01-03 | Lover (Remix) [feat. Shawn Mendes] | 4,595,450 | Taylor Swift | 42227614.0 | 3i9UVldZOE0aD0JnyfAZZ0 | ['pop', 'post-teen pop'] | ... | 0.448 | 0.603 | -7.176 | 0.0640 | 0.43300 | 0.0862 | 205.272 | 221307.0 | 0.422 | G |
1545 rows × 23 columns
The .copy()
is not so important for us, but it can help to prevent warning messages later.
# Redefine df
df = df[~df.isna().any(axis=1)].copy()
df.dtypes
Index int64
Highest Charting Position int64
Number of Times Charted int64
Week of Highest Charting object
Song Name object
Streams object
Artist object
Artist Followers float64
Song ID object
Genre object
Release Date object
Weeks Charted object
Popularity float64
Danceability float64
Energy float64
Loudness float64
Speechiness float64
Acousticness float64
Liveness float64
Tempo float64
Duration (ms) float64
Valence float64
Chord object
dtype: object
df.dtypes.value_counts()
float64 11
object 9
int64 3
dtype: int64
Defining a function¶
We will introduce how to define functions in Python using various examples.
Write a function
somepoly
which takes as input two numbersx
andy
and as output returns $\(x^2 - 3xy + y^3\)$
Our first attempt is wrong because of the 3xy. We need to make the multiplication explicit.
def somepoly(x,y):
z = x**2-3xy+y**3
return z
File "/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/799430308.py", line 2
z = x**2-3xy+y**3
^
SyntaxError: invalid syntax
The following is still not correct.
def somepoly(x,y):
z = x**2-3*xy+y**3
return z
The error message we get is helpful. It says that xy
is not defined. Python can’t predict that this is supposed to mean x*y
.
somepoly(2,1)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/2207245513.py in <module>
----> 1 somepoly(2,1)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/3686898151.py in somepoly(x, y)
1 def somepoly(x,y):
----> 2 z = x**2-3*xy+y**3
3 return z
NameError: name 'xy' is not defined
Here is a correct version. Notice how all the lines inside the function’s code get indented.
def somepoly(x,y):
z = x**2-3*x*y+y**3
return z
somepoly(2,1)
-1
The code can be consolidated into a single line.
def somepoly(x,y):
return x**2-3*x*y+y**3
somepoly(2,1)
-1
Notice that the x
and y
that are defined to be 2
and 1
are only getting defined locally inside the function; we can’t access those values out here.
x
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/32546335.py in <module>
----> 1 x
NameError: name 'x' is not defined
Edit the code for
somepoly
so that if they
value is not given, the default value ofy=4
is used.
def somepoly(x,y=4):
return x**2-3*x*y+y**3
somepoly(2)
44
somepoly(2,4)
44
somepoly(2,1)
-1
These sorts of default values come up all the time. Here is one for the built-in sorted
function.
help(sorted)
Help on built-in function sorted in module builtins:
sorted(iterable, /, *, key=None, reverse=False)
Return a new list containing all items from the iterable in ascending order.
A custom key function can be supplied to customize the sort order, and the
reverse flag can be set to request the result in descending order.
sorted([3,1,4])
[1, 3, 4]
The following is the same as the default.
sorted([3,1,4], reverse=False)
[1, 3, 4]
Here we reverse the order.
sorted([3,1,4], reverse=True)
[4, 3, 1]
Write a function
backwards
which takes as input a list and as input returns the same list, but in the reverse order. For example, if the input is[3,1,4]
, then the output should be[4,1,3]
. (Warning: this has nothing to do with thereverse
above… it’s just a coincidence that both involve the wordreverse
.)
This is a Pythonic solution.
# slicing
def backwards(x):
return x[::-1]
backwards([3,1,4])
[4, 1, 3]
Here is a solution using list comprehension. The slicing solution above is much easier to read so the slicing version is better.
def backwards2(x):
return [x[-i] for i in range(1,len(x)+1)]
backwards2([3,1,4])
[4, 1, 3]
Here are more examples with slicing.
y = [2,1,2,4,5,6,7,1,10]
y[2:6:1]
[2, 4, 5, 6]
y[6:2:-1]
[7, 6, 5, 4]
y[:2:-1]
[10, 1, 7, 6, 5, 4]
y[::-2]
[10, 7, 5, 2, 2]
y[:0:-2]
[10, 7, 5, 2]
Define a function
top5
which takes as input a number and as output returnsTrue
if the number is greater than 0 and less than or equal to 5, and otherwise returnsFalse
.
The main new component here is using and
. In basic Python, one usually uses and
, or
, not
. In pandas and NumPy, one usually uses &
, |
, ~
. (In fact, we used ~
above when we negated our isna
Boolean Series.)
def top5(x):
return (x > 0) and (x <= 5)
top5(3.14)
True
top5(7)
False
True and True
True
True and False
False
Write a function
remove_comma
that takes as input a string and as output returns the same string with all commas removed. (Hint. Every string has areplace
method. Usehelp
to learn how to use it.)
Here is an example of using the replace
method of a string.
s = "chris"
help(s.replace)
Help on built-in function replace:
replace(old, new, count=-1, /) method of builtins.str instance
Return a copy with all occurrences of substring old replaced by new.
count
Maximum number of occurrences to replace.
-1 (the default value) means replace all occurrences.
If the optional argument count is given, only the first count occurrences are
replaced.
"christopher".replace("h","Math 10")
'cMath 10ristopMath 10er'
s = "1,234"
s.replace(",", "")
'1234'
Does it work with integers and not only strings? Not in this case.
12345.replace(2,7)
File "/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/2769868973.py", line 1
12345.replace(2,7)
^
SyntaxError: invalid syntax
Now that we understand how replace
works, we can write a function using it.
def remove_comma(s):
return s.replace(",", "")
remove_comma("abc,def,ghi")
'abcdefghi'
Using functions with pandas Series¶
To apply a function to every value in a pandas Series, we will use the map
method. Try not to confuse this map
method with apply
and applymap
which we will introduce later in Math 10.
Make a Boolean Series from the “Highest Charting Position” column which indicates whether or not the “Highest Charting Position” was in the top 5. Use our usual pandas methods for making Boolean Series.
df["Highest Charting Position"] <= 5
0 True
1 True
2 True
3 True
4 True
...
1551 False
1552 False
1553 False
1554 False
1555 False
Name: Highest Charting Position, Length: 1545, dtype: bool
Make the same Boolean Series using
map
and the functiontop5
that was defined above.
top5(196)
False
df["Highest Charting Position"]
0 1
1 2
2 1
3 3
4 5
...
1551 195
1552 196
1553 197
1554 198
1555 199
Name: Highest Charting Position, Length: 1545, dtype: int64
We use the map
method to apply the top5
function elementwise. Notice that we do not put parentheses after the function name.
# map applies elementwise
df["Highest Charting Position"].map(top5)
0 True
1 True
2 True
3 True
4 True
...
1551 False
1552 False
1553 False
1554 False
1555 False
Name: Highest Charting Position, Length: 1545, dtype: bool
Try to convert the “Streams” column into numeric values using
pd.to_numeric
. (It’s not supposed to work.)
df["Streams"]
0 48,633,449
1 47,248,719
2 40,162,559
3 37,799,456
4 33,948,454
...
1551 4,630,675
1552 4,623,030
1553 4,620,876
1554 4,607,385
1555 4,595,450
Name: Streams, Length: 1545, dtype: object
pd.to_numeric(df["Streams"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "48,633,449"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/2129868835.py in <module>
----> 1 pd.to_numeric(df["Streams"])
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
182 try:
183 values, _ = lib.maybe_convert_numeric(
--> 184 values, set(), coerce_numeric=coerce_numeric
185 )
186 except (ValueError, TypeError):
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "48,633,449" at position 0
int("48,633,449")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/1792345402.py in <module>
----> 1 int("48,633,449")
ValueError: invalid literal for int() with base 10: '48,633,449'
remove_comma("48,633,449")
'48633449'
Instead use
map
and theremove_comma
function from above, followed bypd.to_numeric
.
df["Streams"].map(remove_comma)
0 48633449
1 47248719
2 40162559
3 37799456
4 33948454
...
1551 4630675
1552 4623030
1553 4620876
1554 4607385
1555 4595450
Name: Streams, Length: 1545, dtype: object
pd.to_numeric(df["Streams"].map(remove_comma))
0 48633449
1 47248719
2 40162559
3 37799456
4 33948454
...
1551 4630675
1552 4623030
1553 4620876
1554 4607385
1555 4595450
Name: Streams, Length: 1545, dtype: int64
Here we put the result into a new column of df
.
df["StreamsNumeric"] = pd.to_numeric(df["Streams"].map(remove_comma))
How many of the songs have more than ten million streams? (Hint. The answer should be 103.)
We can’t use the “Streams” column because it contains strings.
df["Streams"] > 10**7
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/33243062.py in <module>
----> 1 df["Streams"] > 10**7
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/common.py in new_method(self, other)
67 other = item_from_zerodim(other)
68
---> 69 return method(self, other)
70
71 return new_method
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arraylike.py in __gt__(self, other)
46 @unpack_zerodim_and_defer("__gt__")
47 def __gt__(self, other):
---> 48 return self._cmp_method(other, operator.gt)
49
50 @unpack_zerodim_and_defer("__ge__")
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/series.py in _cmp_method(self, other, op)
5500
5501 with np.errstate(all="ignore"):
-> 5502 res_values = ops.comparison_op(lvalues, rvalues, op)
5503
5504 return self._construct_result(res_values, name=res_name)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in comparison_op(left, right, op)
282
283 elif is_object_dtype(lvalues.dtype) or isinstance(rvalues, str):
--> 284 res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues)
285
286 else:
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in comp_method_OBJECT_ARRAY(op, x, y)
71 result = libops.vec_compare(x.ravel(), y.ravel(), op)
72 else:
---> 73 result = libops.scalar_compare(x.ravel(), y, op)
74 return result.reshape(x.shape)
75
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/ops.pyx in pandas._libs.ops.scalar_compare()
TypeError: '>' not supported between instances of 'str' and 'int'
It does work if we use the numeric column.
df["StreamsNumeric"] > 10**7
0 True
1 True
2 True
3 True
4 True
...
1551 False
1552 False
1553 False
1554 False
1555 False
Name: StreamsNumeric, Length: 1545, dtype: bool
We count True
using sum
.
(df["StreamsNumeric"] > 10**7).sum()
103
We could also count True
using value_counts
.
(df["StreamsNumeric"] > 10**7).value_counts()
False 1442
True 103
Name: StreamsNumeric, dtype: int64
lambda functions¶
lambda functions provide a concise (and Pythonic) way to quickly define a function. These lambda functions would not be used to create a complex function, but they are frequently used to define simple functions. These are the equivalent of anonymous functions in Matlab.
Again convert the “Streams” column to numeric, this time using a lambda function.
df["Streams"]
0 48,633,449
1 47,248,719
2 40,162,559
3 37,799,456
4 33,948,454
...
1551 4,630,675
1552 4,623,030
1553 4,620,876
1554 4,607,385
1555 4,595,450
Name: Streams, Length: 1545, dtype: object
df["Streams"].map(lambda s: s.replace(",", ""))
0 48633449
1 47248719
2 40162559
3 37799456
4 33948454
...
1551 4630675
1552 4623030
1553 4620876
1554 4607385
1555 4595450
Name: Streams, Length: 1545, dtype: object
def somepoly(x,y):
return x**2-3*x*y+y**3
Here we make the same somepoly
function using a lambda function.
somepoly2 = lambda x,y: x**2-3*x*y+y**3
somepoly2(2,1)
-1