Functions in Python

Announcement

  • My usual office hours are cancelled this week (and Week 7). Replaced by Tuesday 3-4pm on Zoom.

Warm-up

(This part doesn’t have anything to do with functions, but we will use the Spotify dataset later.)

  • Import the attached Spotify dataset, specifying that missing values are denoted in this csv file by a blank space.

  • Drop all rows containing nan values using isna and any.

  • Check your answer: The DataFrame should have 1545 rows, and 9 of the 23 columns should have “object” as their dtype.

import pandas as pd
df = pd.read_csv("../data/spotify_dataset.csv")

It looks like there are no missing values in this dataset:

df["Energy"]
0         0.8
1       0.764
2       0.664
3       0.897
4       0.704
        ...  
1551      0.7
1552     0.87
1553    0.523
1554     0.55
1555    0.603
Name: Energy, Length: 1556, dtype: object
df.isna().any(axis=0)
Index                        False
Highest Charting Position    False
Number of Times Charted      False
Week of Highest Charting     False
Song Name                    False
Streams                      False
Artist                       False
Artist Followers             False
Song ID                      False
Genre                        False
Release Date                 False
Weeks Charted                False
Popularity                   False
Danceability                 False
Energy                       False
Loudness                     False
Speechiness                  False
Acousticness                 False
Liveness                     False
Tempo                        False
Duration (ms)                False
Valence                      False
Chord                        False
dtype: bool

Here is an attempt to find the missing values. The entries are strings.

df["Energy"]
0         0.8
1       0.764
2       0.664
3       0.897
4       0.704
        ...  
1551      0.7
1552     0.87
1553    0.523
1554     0.55
1555    0.603
Name: Energy, Length: 1556, dtype: object
[float(x) for x in df["Energy"]]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/3029641965.py in <module>
----> 1 [float(x) for x in df["Energy"]]

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/3029641965.py in <listcomp>(.0)
----> 1 [float(x) for x in df["Energy"]]

ValueError: could not convert string to float: 

It is difficult to tell from that error message, but after the colon symbol : is a blank space. The blank spaces in this dataset represent the missing values.

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")

Now there are many columns with missing values.

df.isna().any(axis=0)
Index                        False
Highest Charting Position    False
Number of Times Charted      False
Week of Highest Charting     False
Song Name                    False
Streams                      False
Artist                       False
Artist Followers              True
Song ID                       True
Genre                         True
Release Date                  True
Weeks Charted                False
Popularity                    True
Danceability                  True
Energy                        True
Loudness                      True
Speechiness                   True
Acousticness                  True
Liveness                      True
Tempo                         True
Duration (ms)                 True
Valence                       True
Chord                         True
dtype: bool

Here are the rows with missing values.

# Boolean Series
df.isna().any(axis=1)
0       False
1       False
2       False
3       False
4       False
        ...  
1551    False
1552    False
1553    False
1554    False
1555    False
Length: 1556, dtype: bool

The rows we want to keep will be the negation of the above Series.

~df.isna().any(axis=1)
0       True
1       True
2       True
3       True
4       True
        ... 
1551    True
1552    True
1553    True
1554    True
1555    True
Length: 1556, dtype: bool
# Boolean indexing
df[~df.isna().any(axis=1)]
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
0 1 1 8 2021-07-23--2021-07-30 Beggin' 48,633,449 Måneskin 3377762.0 3Wrjm47oTz2sjIgck11l5e ['indie rock italiano', 'italian pop'] ... 0.714 0.800 -4.808 0.0504 0.12700 0.3590 134.002 211560.0 0.589 B
1 2 2 3 2021-07-23--2021-07-30 STAY (with Justin Bieber) 47,248,719 The Kid LAROI 2230022.0 5HCyWlXZPP0y6Gqq8TgA20 ['australian hip hop'] ... 0.591 0.764 -5.484 0.0483 0.03830 0.1030 169.928 141806.0 0.478 C#/Db
2 3 1 11 2021-06-25--2021-07-02 good 4 u 40,162,559 Olivia Rodrigo 6266514.0 4ZtFanR9U6ndgddUvNcjcG ['pop'] ... 0.563 0.664 -5.044 0.1540 0.33500 0.0849 166.928 178147.0 0.688 A
3 4 3 5 2021-07-02--2021-07-09 Bad Habits 37,799,456 Ed Sheeran 83293380.0 6PQ88X9TkUIAUIZJHW2upE ['pop', 'uk pop'] ... 0.808 0.897 -3.712 0.0348 0.04690 0.3640 126.026 231041.0 0.591 B
4 5 5 1 2021-07-23--2021-07-30 INDUSTRY BABY (feat. Jack Harlow) 33,948,454 Lil Nas X 5473565.0 27NovPIUIRrOZoCHxABJwK ['lgbtq+ hip hop', 'pop rap'] ... 0.736 0.704 -7.409 0.0615 0.02030 0.0501 149.995 212000.0 0.894 D#/Eb
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1551 1552 195 1 2019-12-27--2020-01-03 New Rules 4,630,675 Dua Lipa 27167675.0 2ekn2ttSfGqwhhate0LSR0 ['dance pop', 'pop', 'uk pop'] ... 0.762 0.700 -6.021 0.0694 0.00261 0.1530 116.073 209320.0 0.608 A
1552 1553 196 1 2019-12-27--2020-01-03 Cheirosa - Ao Vivo 4,623,030 Jorge & Mateus 15019109.0 2PWjKmjyTZeDpmOUa3a5da ['sertanejo', 'sertanejo universitario'] ... 0.528 0.870 -3.123 0.0851 0.24000 0.3330 152.370 181930.0 0.714 B
1553 1554 197 1 2019-12-27--2020-01-03 Havana (feat. Young Thug) 4,620,876 Camila Cabello 22698747.0 1rfofaqEpACxVEHIZBJe6W ['dance pop', 'electropop', 'pop', 'post-teen ... ... 0.765 0.523 -4.333 0.0300 0.18400 0.1320 104.988 217307.0 0.394 D
1554 1555 198 1 2019-12-27--2020-01-03 Surtada - Remix Brega Funk 4,607,385 Dadá Boladão, Tati Zaqui, OIK 208630.0 5F8ffc8KWKNawllr5WsW0r ['brega funk', 'funk carioca'] ... 0.832 0.550 -7.026 0.0587 0.24900 0.1820 154.064 152784.0 0.881 F
1555 1556 199 1 2019-12-27--2020-01-03 Lover (Remix) [feat. Shawn Mendes] 4,595,450 Taylor Swift 42227614.0 3i9UVldZOE0aD0JnyfAZZ0 ['pop', 'post-teen pop'] ... 0.448 0.603 -7.176 0.0640 0.43300 0.0862 205.272 221307.0 0.422 G

1545 rows × 23 columns

The .copy() is not so important for us, but it can help to prevent warning messages later.

# Redefine df
df = df[~df.isna().any(axis=1)].copy()
df.dtypes
Index                          int64
Highest Charting Position      int64
Number of Times Charted        int64
Week of Highest Charting      object
Song Name                     object
Streams                       object
Artist                        object
Artist Followers             float64
Song ID                       object
Genre                         object
Release Date                  object
Weeks Charted                 object
Popularity                   float64
Danceability                 float64
Energy                       float64
Loudness                     float64
Speechiness                  float64
Acousticness                 float64
Liveness                     float64
Tempo                        float64
Duration (ms)                float64
Valence                      float64
Chord                         object
dtype: object
df.dtypes.value_counts()
float64    11
object      9
int64       3
dtype: int64

Defining a function

We will introduce how to define functions in Python using various examples.

  • Write a function somepoly which takes as input two numbers x and y and as output returns $\(x^2 - 3xy + y^3\)$

Our first attempt is wrong because of the 3xy. We need to make the multiplication explicit.

def somepoly(x,y):
    z = x**2-3xy+y**3
    return z
  File "/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/799430308.py", line 2
    z = x**2-3xy+y**3
               ^
SyntaxError: invalid syntax

The following is still not correct.

def somepoly(x,y):
    z = x**2-3*xy+y**3
    return z

The error message we get is helpful. It says that xy is not defined. Python can’t predict that this is supposed to mean x*y.

somepoly(2,1)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/2207245513.py in <module>
----> 1 somepoly(2,1)

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/3686898151.py in somepoly(x, y)
      1 def somepoly(x,y):
----> 2     z = x**2-3*xy+y**3
      3     return z

NameError: name 'xy' is not defined

Here is a correct version. Notice how all the lines inside the function’s code get indented.

def somepoly(x,y):
    z = x**2-3*x*y+y**3
    return z
somepoly(2,1)
-1

The code can be consolidated into a single line.

def somepoly(x,y):
    return x**2-3*x*y+y**3
somepoly(2,1)
-1

Notice that the x and y that are defined to be 2 and 1 are only getting defined locally inside the function; we can’t access those values out here.

x
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/32546335.py in <module>
----> 1 x

NameError: name 'x' is not defined
  • Edit the code for somepoly so that if the y value is not given, the default value of y=4 is used.

def somepoly(x,y=4):
    return x**2-3*x*y+y**3
somepoly(2)
44
somepoly(2,4)
44
somepoly(2,1)
-1

These sorts of default values come up all the time. Here is one for the built-in sorted function.

help(sorted)
Help on built-in function sorted in module builtins:

sorted(iterable, /, *, key=None, reverse=False)
    Return a new list containing all items from the iterable in ascending order.
    
    A custom key function can be supplied to customize the sort order, and the
    reverse flag can be set to request the result in descending order.
sorted([3,1,4])
[1, 3, 4]

The following is the same as the default.

sorted([3,1,4], reverse=False)
[1, 3, 4]

Here we reverse the order.

sorted([3,1,4], reverse=True)
[4, 3, 1]
  • Write a function backwards which takes as input a list and as input returns the same list, but in the reverse order. For example, if the input is [3,1,4], then the output should be [4,1,3]. (Warning: this has nothing to do with the reverse above… it’s just a coincidence that both involve the word reverse.)

This is a Pythonic solution.

# slicing
def backwards(x):
    return x[::-1]
backwards([3,1,4])
[4, 1, 3]

Here is a solution using list comprehension. The slicing solution above is much easier to read so the slicing version is better.

def backwards2(x):
    return [x[-i] for i in range(1,len(x)+1)]
backwards2([3,1,4])
[4, 1, 3]

Here are more examples with slicing.

y = [2,1,2,4,5,6,7,1,10]
y[2:6:1]
[2, 4, 5, 6]
y[6:2:-1]
[7, 6, 5, 4]
y[:2:-1]
[10, 1, 7, 6, 5, 4]
y[::-2]
[10, 7, 5, 2, 2]
y[:0:-2]
[10, 7, 5, 2]
  • Define a function top5 which takes as input a number and as output returns True if the number is greater than 0 and less than or equal to 5, and otherwise returns False.

The main new component here is using and. In basic Python, one usually uses and, or, not. In pandas and NumPy, one usually uses &, |, ~. (In fact, we used ~ above when we negated our isna Boolean Series.)

def top5(x):
    return (x > 0) and (x <= 5)
top5(3.14)
True
top5(7)
False
True and True
True
True and False
False
  • Write a function remove_comma that takes as input a string and as output returns the same string with all commas removed. (Hint. Every string has a replace method. Use help to learn how to use it.)

Here is an example of using the replace method of a string.

s = "chris"
help(s.replace)
Help on built-in function replace:

replace(old, new, count=-1, /) method of builtins.str instance
    Return a copy with all occurrences of substring old replaced by new.
    
      count
        Maximum number of occurrences to replace.
        -1 (the default value) means replace all occurrences.
    
    If the optional argument count is given, only the first count occurrences are
    replaced.
"christopher".replace("h","Math 10")
'cMath 10ristopMath 10er'
s = "1,234"
s.replace(",", "")
'1234'

Does it work with integers and not only strings? Not in this case.

12345.replace(2,7)
  File "/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/2769868973.py", line 1
    12345.replace(2,7)
                ^
SyntaxError: invalid syntax

Now that we understand how replace works, we can write a function using it.

def remove_comma(s):
    return s.replace(",", "")
remove_comma("abc,def,ghi")
'abcdefghi'

Using functions with pandas Series

To apply a function to every value in a pandas Series, we will use the map method. Try not to confuse this map method with apply and applymap which we will introduce later in Math 10.

  • Make a Boolean Series from the “Highest Charting Position” column which indicates whether or not the “Highest Charting Position” was in the top 5. Use our usual pandas methods for making Boolean Series.

df["Highest Charting Position"] <= 5
0        True
1        True
2        True
3        True
4        True
        ...  
1551    False
1552    False
1553    False
1554    False
1555    False
Name: Highest Charting Position, Length: 1545, dtype: bool
  • Make the same Boolean Series using map and the function top5 that was defined above.

top5(196)
False
df["Highest Charting Position"]
0         1
1         2
2         1
3         3
4         5
       ... 
1551    195
1552    196
1553    197
1554    198
1555    199
Name: Highest Charting Position, Length: 1545, dtype: int64

We use the map method to apply the top5 function elementwise. Notice that we do not put parentheses after the function name.

# map applies elementwise
df["Highest Charting Position"].map(top5)
0        True
1        True
2        True
3        True
4        True
        ...  
1551    False
1552    False
1553    False
1554    False
1555    False
Name: Highest Charting Position, Length: 1545, dtype: bool
  • Try to convert the “Streams” column into numeric values using pd.to_numeric. (It’s not supposed to work.)

df["Streams"]
0       48,633,449
1       47,248,719
2       40,162,559
3       37,799,456
4       33,948,454
           ...    
1551     4,630,675
1552     4,623,030
1553     4,620,876
1554     4,607,385
1555     4,595,450
Name: Streams, Length: 1545, dtype: object
pd.to_numeric(df["Streams"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "48,633,449"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/2129868835.py in <module>
----> 1 pd.to_numeric(df["Streams"])

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
    182         try:
    183             values, _ = lib.maybe_convert_numeric(
--> 184                 values, set(), coerce_numeric=coerce_numeric
    185             )
    186         except (ValueError, TypeError):

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "48,633,449" at position 0
int("48,633,449")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/1792345402.py in <module>
----> 1 int("48,633,449")

ValueError: invalid literal for int() with base 10: '48,633,449'
remove_comma("48,633,449")
'48633449'
  • Instead use map and the remove_comma function from above, followed by pd.to_numeric.

df["Streams"].map(remove_comma)
0       48633449
1       47248719
2       40162559
3       37799456
4       33948454
          ...   
1551     4630675
1552     4623030
1553     4620876
1554     4607385
1555     4595450
Name: Streams, Length: 1545, dtype: object
pd.to_numeric(df["Streams"].map(remove_comma))
0       48633449
1       47248719
2       40162559
3       37799456
4       33948454
          ...   
1551     4630675
1552     4623030
1553     4620876
1554     4607385
1555     4595450
Name: Streams, Length: 1545, dtype: int64

Here we put the result into a new column of df.

df["StreamsNumeric"] = pd.to_numeric(df["Streams"].map(remove_comma))
  • How many of the songs have more than ten million streams? (Hint. The answer should be 103.)

We can’t use the “Streams” column because it contains strings.

df["Streams"] > 10**7
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_49629/33243062.py in <module>
----> 1 df["Streams"] > 10**7

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/common.py in new_method(self, other)
     67         other = item_from_zerodim(other)
     68 
---> 69         return method(self, other)
     70 
     71     return new_method

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arraylike.py in __gt__(self, other)
     46     @unpack_zerodim_and_defer("__gt__")
     47     def __gt__(self, other):
---> 48         return self._cmp_method(other, operator.gt)
     49 
     50     @unpack_zerodim_and_defer("__ge__")

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/series.py in _cmp_method(self, other, op)
   5500 
   5501         with np.errstate(all="ignore"):
-> 5502             res_values = ops.comparison_op(lvalues, rvalues, op)
   5503 
   5504         return self._construct_result(res_values, name=res_name)

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in comparison_op(left, right, op)
    282 
    283     elif is_object_dtype(lvalues.dtype) or isinstance(rvalues, str):
--> 284         res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues)
    285 
    286     else:

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in comp_method_OBJECT_ARRAY(op, x, y)
     71         result = libops.vec_compare(x.ravel(), y.ravel(), op)
     72     else:
---> 73         result = libops.scalar_compare(x.ravel(), y, op)
     74     return result.reshape(x.shape)
     75 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/ops.pyx in pandas._libs.ops.scalar_compare()

TypeError: '>' not supported between instances of 'str' and 'int'

It does work if we use the numeric column.

df["StreamsNumeric"] > 10**7
0        True
1        True
2        True
3        True
4        True
        ...  
1551    False
1552    False
1553    False
1554    False
1555    False
Name: StreamsNumeric, Length: 1545, dtype: bool

We count True using sum.

(df["StreamsNumeric"] > 10**7).sum()
103

We could also count True using value_counts.

(df["StreamsNumeric"] > 10**7).value_counts()
False    1442
True      103
Name: StreamsNumeric, dtype: int64

lambda functions

lambda functions provide a concise (and Pythonic) way to quickly define a function. These lambda functions would not be used to create a complex function, but they are frequently used to define simple functions. These are the equivalent of anonymous functions in Matlab.

  • Again convert the “Streams” column to numeric, this time using a lambda function.

df["Streams"]
0       48,633,449
1       47,248,719
2       40,162,559
3       37,799,456
4       33,948,454
           ...    
1551     4,630,675
1552     4,623,030
1553     4,620,876
1554     4,607,385
1555     4,595,450
Name: Streams, Length: 1545, dtype: object
df["Streams"].map(lambda s: s.replace(",", ""))
0       48633449
1       47248719
2       40162559
3       37799456
4       33948454
          ...   
1551     4630675
1552     4623030
1553     4620876
1554     4607385
1555     4595450
Name: Streams, Length: 1545, dtype: object
def somepoly(x,y):
    return x**2-3*x*y+y**3

Here we make the same somepoly function using a lambda function.

somepoly2 = lambda x,y: x**2-3*x*y+y**3
somepoly2(2,1)
-1