Working with dates in pandas

The first part is about how to find missing values (for example, how to know that they are represented by a blank space in the Spotify dataset). The second part is about working with dates in pandas.

count

lists and strings both have a count method.

my_list = [3,4,3135]
my_list.count(3)
1
s = "Hello there, how are you?"
s.count("e")
4
s.count("re")
2
s.count("er")
1

How to identify na_values?

The missing values in the spotify dataset are denoted by a blank space ” “. If you didn’t know to look for ” “, how could you identify it? Here is one approach.

  • Import the attached Spotify dataset as df. Do not specify na_values.

  • Most of the cells in the “Release Date” column contain a string that looks like “2017-12-08”. Using list comprehension and the count method from above, make a list shortlist containing all the entries which have fewer than two hyphen (minus) symbols.

  • Do the same thing using a for loop and an if statement.

  • Check: shortlist should have 28 entries.

  • One at a time, print the elements with fewer than two hyphen symbols. (Use a for loop for this part.)

import pandas as pd
df = pd.read_csv("../data/spotify_dataset.csv")
df.loc[:10,"Release Date"]
0     2017-12-08
1     2021-07-09
2     2021-05-21
3     2021-06-25
4     2021-07-23
5     2021-03-31
6     2021-04-09
7     2021-05-20
8     2021-06-04
9     2021-03-19
10    2020-10-01
Name: Release Date, dtype: object
shortlist = [s for s in df["Release Date"] if s.count("-") < 2]
shortlist
[' ',
 '2014',
 ' ',
 '1995',
 ' ',
 '2015',
 ' ',
 '1995',
 '1963',
 ' ',
 '1962',
 '1963',
 ' ',
 '2002',
 '1990',
 '1981',
 '1957-09',
 '1957-09',
 '1994',
 ' ',
 ' ',
 ' ',
 '2012',
 '1981',
 ' ',
 '2016',
 '2016',
 ' ']

Here is an alternative approach.

altlist = [s for s in df["Release Date"] if len(s) < 8]
altlist
[' ',
 '2014',
 ' ',
 '1995',
 ' ',
 '2015',
 ' ',
 '1995',
 '1963',
 ' ',
 '1962',
 '1963',
 ' ',
 '2002',
 '1990',
 '1981',
 '1957-09',
 '1957-09',
 '1994',
 ' ',
 ' ',
 ' ',
 '2012',
 '1981',
 ' ',
 '2016',
 '2016',
 ' ']
len(shortlist)
28

Compare the following for loop version with the list comprehension version.

[s for s in df["Release Date"] if s.count("-") < 2]
shortlist2 = []
for s in df["Release Date"]:
    if s.count("-") < 2:
        shortlist2.append(s)
shortlist2
[' ',
 '2014',
 ' ',
 '1995',
 ' ',
 '2015',
 ' ',
 '1995',
 '1963',
 ' ',
 '1962',
 '1963',
 ' ',
 '2002',
 '1990',
 '1981',
 '1957-09',
 '1957-09',
 '1994',
 ' ',
 ' ',
 ' ',
 '2012',
 '1981',
 ' ',
 '2016',
 '2016',
 ' ']

Here we look at all of the values in shortlist. You should not put print statements in a list comprehension, so we use a for loop.

for s in shortlist:
    print(s)
 
2014
 
1995
 
2015
 
1995
1963
 
1962
1963
 
2002
1990
1981
1957-09
1957-09
1994
 
 
 
2012
1981
 
2016
2016
 

There is a pandas function pd.to_datetime that can be applied to both a string and also to a pandas Series.

d = pd.to_datetime("2022-04-13")
d
Timestamp('2022-04-13 00:00:00')

This new data type has lots of convenient methods and attributes.

d.month
4
d.day_name()
'Wednesday'

Notice the dtype of the following pandas Series.

pd.to_datetime(df.loc[:10,"Release Date"])
0    2017-12-08
1    2021-07-09
2    2021-05-21
3    2021-06-25
4    2021-07-23
5    2021-03-31
6    2021-04-09
7    2021-05-20
8    2021-06-04
9    2021-03-19
10   2020-10-01
Name: Release Date, dtype: datetime64[ns]

What if we want to convert these into the day names? The first thing we might try does not work.

pd.to_datetime(df.loc[:10,"Release Date"]).day_name()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_38287/3366743804.py in <module>
----> 1 pd.to_datetime(df.loc[:10,"Release Date"]).day_name()

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5485         ):
   5486             return self[name]
-> 5487         return object.__getattribute__(self, name)
   5488 
   5489     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'day_name'

The trick is to use the dt accessor. Here is the documentation, although I don’t find this particular documentation very helpful.

pd.to_datetime(df.loc[:10,"Release Date"]).dt.day_name()
0        Friday
1        Friday
2        Friday
3        Friday
4        Friday
5     Wednesday
6        Friday
7      Thursday
8        Friday
9        Friday
10     Thursday
Name: Release Date, dtype: object

If we try to apply pd.to_datetime to the entire column, an error will be raised. This is because of the blank spaces in this column.

pd.to_datetime(df["Release Date"])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
   2191         try:
-> 2192             values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
   2193             # If tzaware, these values represent unix timestamps, so we

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()

TypeError: Unrecognized value type: <class 'str'>

During handling of the above exception, another exception occurred:

ParserError                               Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_38287/4002128276.py in <module>
----> 1 pd.to_datetime(df["Release Date"])

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/tools/datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
    881                 result = result.tz_localize(tz)  # type: ignore[call-arg]
    882     elif isinstance(arg, ABCSeries):
--> 883         cache_array = _maybe_cache(arg, format, cache, convert_listlike)
    884         if not cache_array.empty:
    885             result = arg.map(cache_array)

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/tools/datetimes.py in _maybe_cache(arg, format, cache, convert_listlike)
    193         unique_dates = unique(arg)
    194         if len(unique_dates) < len(arg):
--> 195             cache_dates = convert_listlike(unique_dates, format)
    196             cache_array = Series(cache_dates, index=unique_dates)
    197             # GH#39882 and GH#35888 in case of None and NaT we get duplicates

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/tools/datetimes.py in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    406         errors=errors,
    407         require_iso8601=require_iso8601,
--> 408         allow_object=True,
    409     )
    410 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
   2196             return values.view("i8"), tz_parsed
   2197         except (ValueError, TypeError):
-> 2198             raise err
   2199 
   2200     if tz_parsed is not None:

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
   2185             yearfirst=yearfirst,
   2186             require_iso8601=require_iso8601,
-> 2187             allow_mixed=allow_mixed,
   2188         )
   2189         result = result.reshape(data.shape, order=order)

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib._array_to_datetime_object()

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib._array_to_datetime_object()

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/dateutil/parser/_parser.py in parse(timestr, parserinfo, **kwargs)
   1366         return parser(parserinfo).parse(timestr, **kwargs)
   1367     else:
-> 1368         return DEFAULTPARSER.parse(timestr, **kwargs)
   1369 
   1370 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/dateutil/parser/_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    644 
    645         if len(res) == 0:
--> 646             raise ParserError("String does not contain a date: %s", timestr)
    647 
    648         try:

ParserError: String does not contain a date:  
shortlist
[' ',
 '2014',
 ' ',
 '1995',
 ' ',
 '2015',
 ' ',
 '1995',
 '1963',
 ' ',
 '1962',
 '1963',
 ' ',
 '2002',
 '1990',
 '1981',
 '1957-09',
 '1957-09',
 '1994',
 ' ',
 ' ',
 ' ',
 '2012',
 '1981',
 ' ',
 '2016',
 '2016',
 ' ']
  • Give an example of an element x in shortlist for which pd.to_datetime(x) raises an error, and an example of an element y in shortlist for which pd.to_datetime(y) does not raise an error.

pd.to_datetime(" ")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
   2191         try:
-> 2192             values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
   2193             # If tzaware, these values represent unix timestamps, so we

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()

TypeError: Unrecognized value type: <class 'str'>

During handling of the above exception, another exception occurred:

ParserError                               Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_38287/4212689216.py in <module>
----> 1 pd.to_datetime(" ")

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/tools/datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
    912             result = convert_listlike(arg, format)
    913     else:
--> 914         result = convert_listlike(np.array([arg]), format)[0]
    915 
    916     #  error: Incompatible return value type (got "Union[Timestamp, NaTType,

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/tools/datetimes.py in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    406         errors=errors,
    407         require_iso8601=require_iso8601,
--> 408         allow_object=True,
    409     )
    410 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
   2196             return values.view("i8"), tz_parsed
   2197         except (ValueError, TypeError):
-> 2198             raise err
   2199 
   2200     if tz_parsed is not None:

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
   2185             yearfirst=yearfirst,
   2186             require_iso8601=require_iso8601,
-> 2187             allow_mixed=allow_mixed,
   2188         )
   2189         result = result.reshape(data.shape, order=order)

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib._array_to_datetime_object()

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib._array_to_datetime_object()

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/dateutil/parser/_parser.py in parse(timestr, parserinfo, **kwargs)
   1366         return parser(parserinfo).parse(timestr, **kwargs)
   1367     else:
-> 1368         return DEFAULTPARSER.parse(timestr, **kwargs)
   1369 
   1370 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/dateutil/parser/_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    644 
    645         if len(res) == 0:
--> 646             raise ParserError("String does not contain a date: %s", timestr)
    647 
    648         try:

ParserError: String does not contain a date:  
pd.to_datetime("1957-09")
Timestamp('1957-09-01 00:00:00')

try and except

We would like to find the “bad” cells in df["Release Date"] using something like the following pseudocode:

badlist = []
for x in df["Release Date"]:
    if pd.to_datetime(x) raises an error:
        badlist.append(x)

The correct way to do that does not involve an if statement. Instead it involves try and except. Here we also put the “good” values into their own list.

badlist = []
goodlist = []
for x in df["Release Date"]:
    try:
        pd.to_datetime(x)
        goodlist.append(x)
    except:
        badlist.append(x)
goodlist
['2017-12-08',
 '2021-07-09',
 '2021-05-21',
 '2021-06-25',
 '2021-07-23',
 '2021-03-31',
 '2021-04-09',
 '2021-05-20',
 '2021-06-04',
 '2021-03-19',
 '2020-10-01',
 '2021-05-28',
 '2021-07-09',
 '2021-03-19',
 '2021-06-04',
 '2021-05-21',
 '2021-05-21',
 '2021-04-23',
 '2021-06-24',
 '2021-06-25',
 '2021-07-08',
 '2021-06-25',
 '2021-05-21',
 '2020-08-07',
 '2020-03-20',
 '2021-05-21',
 '2021-06-24',
 '2021-07-09',
 '2021-01-15',
 '2021-01-06',
 '2021-03-05',
 '2021-07-23',
 '2021-05-21',
 '2021-04-16',
 '2019-12-13',
 '2021-04-09',
 '2021-06-25',
 '2021-07-09',
 '2020-11-27',
 '2021-04-29',
 '2021-05-21',
 '2021-06-18',
 '2020-11-20',
 '2020-09-16',
 '2021-05-20',
 '2021-03-26',
 '2021-07-08',
 '2021-06-15',
 '2021-05-14',
 '2021-06-30',
 '2021-04-01',
 '2021-07-30',
 '2021-06-25',
 '2021-05-21',
 '2021-05-15',
 '2021-06-01',
 '2021-02-14',
 '2021-06-11',
 '2021-05-05',
 '2021-06-11',
 '2021-01-15',
 '2021-05-21',
 '2021-04-23',
 '2020-03-27',
 '2021-08-13',
 '2018-02-02',
 '2013-04-19',
 '2021-07-08',
 '2019-05-17',
 '2021-07-23',
 '2021-06-02',
 '2021-03-26',
 '2021-05-13',
 '2017-12-14',
 '2021-03-19',
 '2020-12-04',
 '2021-04-23',
 '2021-02-04',
 '2019-10-17',
 '2020-11-06',
 '2021-03-05',
 '2021-06-10',
 '2021-08-13',
 '2021-01-15',
 '2020-07-03',
 '2021-02-26',
 '2019-06-03',
 '2021-07-22',
 '2019-03-07',
 '2021-02-05',
 '2020-10-30',
 '2019-09-06',
 '2021-07-16',
 '2021-04-20',
 '2021-08-12',
 '2021-06-25',
 '2021-04-27',
 '2021-07-08',
 '2019-11-22',
 '2020-10-01',
 '2021-06-04',
 '2021-06-16',
 '2017-06-23',
 '2020-08-14',
 '2010-10-05',
 '2020-11-20',
 '2021-04-15',
 '2021-06-03',
 '2021-07-09',
 '2018-04-19',
 '2018-12-14',
 '2019-05-24',
 '2019-11-07',
 '2021-06-04',
 '2021-06-17',
 '2017-03-03',
 '2019-06-21',
 '2021-08-13',
 '2017-03-03',
 '2021-06-21',
 '2020-11-27',
 '2020-04-17',
 '2020-03-20',
 '2020-04-17',
 '2014-01-01',
 '2014-09-09',
 '1991-09-26',
 '2021-05-21',
 '2015-10-30',
 '2021-06-01',
 '2021-05-07',
 '2021-06-11',
 '2013-09-17',
 '2021-04-30',
 '2014-09-19',
 '2021-01-08',
 '2021-06-19',
 '2021-02-05',
 '2020-11-05',
 '2021-07-21',
 '1975-11-21',
 '2020-12-10',
 '2020-10-30',
 '2021-01-15',
 '2021-01-08',
 '2021-07-14',
 '2018-10-05',
 '2021-05-28',
 '2021-04-29',
 '2014-01-01',
 '2000-07-10',
 '2021-07-13',
 '2021-06-01',
 '2013-01-01',
 '2021-02-26',
 '2021-08-13',
 '2020-07-03',
 '2014',
 '2021-06-25',
 '2021-06-18',
 '2018-12-10',
 '2021-04-02',
 '2021-08-13',
 '2021-02-12',
 '2020-07-03',
 '2021-05-21',
 '2020-08-07',
 '2004-06-15',
 '1983-06-17',
 '1977-02-04',
 '2016-10-28',
 '2021-07-22',
 '2021-07-16',
 '2021-07-16',
 '2020-02-08',
 '2021-05-21',
 '1980-06-27',
 '2016-05-06',
 '2019-12-13',
 '2021-06-10',
 '2021-06-04',
 '2021-06-25',
 '2013-08-30',
 '2020-05-15',
 '2019-07-26',
 '2021-01-07',
 '2015-08-28',
 '2020-12-25',
 '2021-07-23',
 '2021-07-06',
 '2019-02-08',
 '2020-12-04',
 '2021-02-25',
 '2019-12-19',
 '2021-07-18',
 '2021-06-04',
 '2021-07-22',
 '2000-05-23',
 '2021-07-31',
 '2021-07-31',
 '2021-07-16',
 '2021-07-31',
 '2021-07-31',
 '2021-06-18',
 '2021-07-31',
 '2021-07-31',
 '2021-06-25',
 '2021-06-24',
 '2021-04-22',
 '2021-07-31',
 '2021-01-28',
 '2021-06-17',
 '2019-07-26',
 '2021-08-12',
 '2021-07-16',
 '2021-07-02',
 '2021-04-09',
 '2021-05-13',
 '2021-03-12',
 '2021-02-11',
 '2020-11-20',
 '2021-07-09',
 '2021-03-12',
 '2021-07-09',
 '2021-05-21',
 '2021-05-26',
 '2021-06-11',
 '2021-07-09',
 '2021-07-09',
 '2021-03-18',
 '2021-03-27',
 '2021-03-05',
 '2021-06-25',
 '2016-07-29',
 '1995',
 '2021-07-01',
 '2021-06-25',
 '2021-07-02',
 '2021-06-25',
 '2021-06-10',
 '2021-04-30',
 '2021-04-23',
 '2021-06-10',
 '2021-07-30',
 '2021-07-01',
 '2020-08-20',
 '2021-01-29',
 '2020-11-20',
 '2011-11-11',
 '2020-10-22',
 '2021-05-13',
 '2021-06-09',
 '2013-09-09',
 '2021-06-25',
 '2021-06-25',
 '2021-06-25',
 '2021-06-25',
 '2021-06-25',
 '2021-06-25',
 '2021-06-25',
 '2021-06-25',
 '2021-06-25',
 '2021-06-25',
 '2021-06-25',
 '2021-06-25',
 '2021-06-25',
 '2021-06-11',
 '2021-06-25',
 '2021-06-25',
 '2021-06-11',
 '2020-07-24',
 '2020-06-11',
 '2021-06-18',
 '2020-07-31',
 '2021-06-17',
 '2021-06-04',
 '2021-05-14',
 '2021-05-14',
 '2017-08-25',
 '2021-06-11',
 '2019-12-06',
 '2021-01-21',
 '2012-10-09',
 '2021-08-13',
 '2020-09-25',
 '2010-10-05',
 '2019-10-04',
 '2020-03-27',
 '2020-08-14',
 '2021-05-13',
 '2021-02-03',
 '2000-10-24',
 '2014-12-09',
 '2018-08-03',
 '2017-04-07',
 '2021-04-16',
 '2021-05-14',
 '2021-05-21',
 '2021-06-11',
 '2021-06-11',
 '2021-06-11',
 '2020-09-18',
 '2021-06-11',
 '2020-11-02',
 '2021-06-11',
 '2021-06-11',
 '2021-08-13',
 '2020-11-27',
 '2021-04-09',
 '2021-08-13',
 '2021-05-21',
 '2021-06-11',
 '2020-05-22',
 '2021-06-04',
 '2021-07-16',
 '2021-06-04',
 '2021-05-31',
 '2021-05-14',
 '2021-05-06',
 '2021-04-16',
 '2020-10-08',
 '2019-12-13',
 '2021-03-19',
 '2021-08-13',
 '2021-01-15',
 '2021-05-28',
 '2020-12-18',
 '2018-12-10',
 '2021-05-28',
 '2019-04-12',
 '2021-01-22',
 '2020-01-17',
 '2020-02-07',
 '2011-01-01',
 '2020-09-18',
 '2021-04-02',
 '2021-05-14',
 '2021-06-04',
 '2021-04-01',
 '2020-11-06',
 '2021-05-21',
 '2021-05-21',
 '2021-05-14',
 '2021-05-14',
 '2021-01-21',
 '2021-03-10',
 '2021-04-30',
 '2021-05-14',
 '2021-03-05',
 '2021-04-30',
 '2020-02-14',
 '2021-03-10',
 '2021-04-02',
 '2021-05-21',
 '2021-03-13',
 '2021-05-21',
 '2021-05-14',
 '2021-05-14',
 '2021-05-14',
 '2021-05-14',
 '2021-05-14',
 '2021-05-14',
 '2021-04-30',
 '2021-04-28',
 '2021-03-11',
 '2012-04-30',
 '2019-12-13',
 '2020-12-15',
 '2020-11-20',
 '2021-08-09',
 '2020-10-16',
 '2005-02-22',
 '2019-10-17',
 '2021-04-30',
 '2021-04-22',
 '2021-08-13',
 '2021-05-04',
 '2021-03-10',
 '2020-11-13',
 '2020-07-17',
 '2020-01-10',
 '2021-05-13',
 '2020-07-10',
 '2021-02-08',
 '2021-01-22',
 '2021-03-01',
 '2020-07-10',
 '2019-03-08',
 '2021-04-22',
 '2020-08-25',
 '2021-04-30',
 '2021-04-29',
 '2020-07-17',
 '2021-04-30',
 '2021-05-21',
 '2021-03-12',
 '2021-04-09',
 '2021-08-13',
 '2021-04-09',
 '2021-04-02',
 '2020-11-06',
 '2021-04-02',
 '2020-09-23',
 '2021-03-19',
 '2021-04-23',
 '2021-01-15',
 '2020-04-24',
 '2020-10-15',
 '2016-10-28',
 '2020-11-20',
 '2020-03-20',
 '2018-03-16',
 '2021-03-26',
 '2020-08-21',
 '2012-12-07',
 '2021-08-12',
 '2015-05-15',
 '2021-04-16',
 '2021-04-16',
 '2021-04-09',
 '2020-11-12',
 '2021-02-19',
 '2020-12-11',
 '2013-09-09',
 '2021-08-13',
 '2019-06-28',
 '2021-04-09',
 '2021-04-09',
 '2007-06-12',
 '2021-04-09',
 '2021-04-09',
 '2021-03-26',
 '2021-04-09',
 '2021-04-09',
 '2021-04-09',
 '2021-04-09',
 '1998-05-01',
 '2021-04-09',
 '2021-04-09',
 '2021-04-09',
 '2021-08-09',
 '2020-09-04',
 '2021-04-09',
 '2021-04-09',
 '2021-04-01',
 '2021-03-26',
 '2021-03-12',
 '2021-02-04',
 '2020-07-20',
 '2019-10-04',
 '2020-07-22',
 '2021-03-19',
 '2020-05-08',
 '2008-09-23',
 '2020-11-26',
 '2021-03-04',
 '2012-01-01',
 '2021-03-19',
 '2021-03-19',
 '2020-12-11',
 '1985-06-01',
 '2020-11-25',
 '2021-03-19',
 '2021-03-19',
 '2021-03-19',
 '2019-11-07',
 '2020-09-04',
 '2021-03-19',
 '2021-01-12',
 '2021-03-05',
 '2020-11-26',
 '2020-11-20',
 '2021-03-19',
 '2016-11-25',
 '2020-07-24',
 '2020-11-13',
 '2021-03-19',
 '2010-01-01',
 '2021-03-19',
 '2020-11-20',
 '2021-03-19',
 '2020-08-21',
 '2021-01-18',
 '2020-09-21',
 '2016-02-05',
 '2020-11-20',
 '2021-03-12',
 '2021-03-12',
 '2021-03-04',
 '2018-11-01',
 '2021-03-12',
 '2020-06-12',
 '2020-10-16',
 '2019-11-07',
 '2018-03-16',
 '2015-07-17',
 '2016-11-25',
 '2020-07-03',
 '2019-10-04',
 '2019-06-14',
 '2020-10-02',
 '2021-03-11',
 '2021-08-13',
 '2021-02-26',
 '2021-03-05',
 '2021-03-03',
 '2021-01-22',
 '2021-03-04',
 '2021-03-04',
 '2020-09-02',
 '2020-05-08',
 '2015',
 '2015-10-16',
 '2020-03-20',
 '2021-02-25',
 '2021-02-19',
 '2021-02-26',
 '2020-10-23',
 '2020-08-14',
 '2019-10-10',
 '2018-04-27',
 '2021-02-26',
 '2018-10-04',
 '2020-11-27',
 '2020-02-14',
 '2019-08-02',
 '2014-06-21',
 '2021-02-26',
 '2021-02-18',
 '2021-08-13',
 '2021-01-22',
 '2020-10-22',
 '2020-11-06',
 '2021-01-21',
 '2020-11-13',
 '2020-09-04',
 '2020-10-30',
 '2019-05-10',
 '2021-08-12',
 '2020-11-27',
 '2020-11-05',
 '2020-07-09',
 '2020-12-17',
 '2018-03-16',
 '2014-06-21',
 '2020-04-03',
 '2015-08-28',
 '2019-11-26',
 '1995',
 '2017-08-25',
 '2021-01-15',
 '2021-01-21',
 '2021-02-04',
 '2015-08-28',
 '2021-01-29',
 '2018-03-30',
 '2020-08-14',
 '2021-02-05',
 '2013-04-01',
 '2021-01-28',
 '2020-05-19',
 '2020-03-20',
 '2020-10-30',
 '2018-06-29',
 '2020-11-20',
 '2020-04-03',
 '2021-01-14',
 '2021-01-29',
 '2021-01-08',
 '2019-11-27',
 '2020-07-09',
 '2021-01-08',
 '2020-04-24',
 '2021-01-22',
 '2020-11-07',
 '2020-11-20',
 '2021-03-12',
 '2020-12-09',
 '2020-09-17',
 '2020-10-26',
 '2020-11-13',
 '2020-12-11',
 '2020-05-08',
 '2020-12-08',
 '2021-01-15',
 '2021-08-11',
 '2020-06-19',
 '2020-09-18',
 '2020-10-02',
 '2020-06-11',
 '2020-06-05',
 '2021-01-08',
 '2021-01-08',
 '2021-01-08',
 '2020-11-27',
 '2020-10-31',
 '2020-07-03',
 '2021-01-08',
 '2021-01-08',
 '2020-04-17',
 '2020-10-02',
 '2020-05-01',
 '2020-10-29',
 '2021-01-01',
 '2020-12-11',
 '2019-01-06',
 '2020-10-16',
 '2018-08-17',
 '1978-11-10',
 '2020-03-19',
 '2020-05-29',
 '2020-12-11',
 '2019-10-11',
 '2020-10-26',
 '2020-09-25',
 '1994-11-01',
 '1984-11-29',
 '2014-11-24',
 '1964-10-19',
 '1957-12-02',
 '2012-11-09',
 '1963-11-24',
 '2013-10-29',
 '2002-09-24',
 '1959-01-01',
 '1994-11-08',
 '2011-01-01',
 '1963',
 '2004-01-01',
 '1980-05-16',
 '2012-11-09',
 '2010-10-05',
 '1962',
 '1942-01-01',
 '2020-12-25',
 '1965-01-01',
 '1988-09-05',
 '1991-11-25',
 '1957-10-15',
 '2021-08-06',
 '2021-08-06',
 '2010-11-30',
 '1999-08-20',
 '1963',
 '1973-10-05',
 '2020-12-11',
 '2019-12-13',
 '2020-11-13',
 '1983-11-01',
 '2016-12-16',
 '2014-12-05',
 '2018-11-01',
 '2000-11-14',
 '2016-10-28',
 '2019-11-08',
 '2020-10-15',
 '2017-10-06',
 '2020-12-18',
 '2021-08-06',
 '2012-11-09',
 '2014-10-10',
 '2020-12-11',
 '1970-10-15',
 '2002',
 '1990',
 '2020-12-11',
 '2021-08-06',
 '2020-12-07',
 '1959-01-01',
 '1981',
 '2013-12-02',
 '2021-08-06',
 '2020-10-30',
 '2015-01-09',
 '1959-01-01',
 '1957-09',
 '1957-09',
 '2020-12-04',
 '2006-01-01',
 '2017-12-06',
 '1990-04-03',
 '2014-12-09',
 '2019-12-06',
 '2018-11-15',
 '1998-10-30',
 '2020-12-11',
 '1994',
 '2020-10-30',
 '2020-12-11',
 '2020-12-11',
 '2020-12-11',
 '2020-12-11',
 '2020-12-11',
 '2020-12-11',
 '2020-12-11',
 '2020-12-11',
 '2020-12-11',
 '2020-12-11',
 '2020-12-11',
 '2020-12-11',
 '2020-12-11',
 '2020-12-11',
 '2020-12-11',
 '2020-12-11',
 '2020-11-27',
 '2020-12-11',
 '2020-12-11',
 '2020-07-24',
 '2020-12-02',
 '2020-12-04',
 '2020-12-04',
 '2020-06-24',
 '2020-11-27',
 '2020-12-03',
 '2020-11-27',
 '2020-11-20',
 '2020-07-24',
 '2020-02-29',
 '2020-11-27',
 '2020-07-24',
 '2020-12-04',
 '2020-11-06',
 '2020-11-27',
 '2020-11-27',
 '2020-11-27',
 '2020-11-27',
 '2020-11-27',
 '2020-11-27',
 '2020-11-27',
 '1980-05-16',
 '2020-11-20',
 '2020-11-20',
 '2020-11-20',
 '2020-11-20',
 '2020-09-29',
 '2020-11-27',
 '2020-11-27',
 '2020-05-09',
 '2020-11-20',
 '2020-12-18',
 '2020-08-21',
 '2020-11-13',
 '2020-12-18',
 '2020-01-24',
 '2020-12-18',
 '2020-06-12',
 '2020-11-06',
 '2020-07-10',
 '2020-10-02',
 '2020-10-02',
 '2013-03-15',
 '2020-12-18',
 '2020-10-02',
 '2019-03-22',
 '2019-10-04',
 '2020-03-27',
 '2020-05-01',
 '2020-04-21',
 '2020-11-13',
 '2020-07-10',
 '2020-10-28',
 '2020-09-18',
 '2020-11-13',
 '2020-03-13',
 '2020-10-30',
 '2020-10-23',
 '2020-11-06',
 '2020-09-15',
 '2020-11-13',
 '2006-01-01',
 '2020-07-16',
 '2020-10-30',
 '2019-10-04',
 '2020-05-29',
 '2018-10-08',
 '2020-12-02',
 '2020-10-23',
 '2018-10-26',
 '2020-09-11',
 '2021-07-30',
 '2020-08-14',
 '2020-10-30',
 '2020-10-30',
 '2020-10-30',
 '2020-11-06',
 '2020-08-20',
 '2020-08-21',
 '2020-08-17',
 '2020-10-02',
 '2020-07-03',
 '2020-10-30',
 '2020-10-30',
 '2019-06-28',
 '2020-03-26',
 '2020-03-13',
 '2021-08-09',
 '2020-10-30',
 '2020-10-30',
 '2020-10-30',
 '2017-09-27',
 '2019-01-11',
 '2020-10-29',
 '2020-10-30',
 '1979-07-27',
 '2020-09-18',
 '2020-08-21',
 '2020-09-04',
 '2020-09-25',
 '2020-10-02',
 '2020-09-04',
 '2019-02-08',
 '2020-04-10',
 '2020-10-13',
 '2021-08-13',
 '2020-04-24',
 '2020-07-23',
 '2013-02-17',
 '2015-08-14',
 '2020-05-29',
 '2011-01-24',
 '2019-11-08',
 '2020-10-09',
 '2020-10-02',
 '2020-10-16',
 '2020-08-15',
 '2020-10-02',
 '2020-10-09',
 '2020-05-28',
 '2020-06-10',
 '2020-02-29',
 '2021-03-19',
 '2020-05-22',
 '2020-03-26',
 '2020-05-15',
 '2019-05-17',
 '2021-08-03',
 '2019-11-15',
 '2020-10-02',
 '2020-09-25',
 '2020-09-25',
 '2020-06-30',
 '2020-06-26',
 '2019-06-28',
 '2020-10-02',
 '2020-10-02',
 '2020-10-02',
 '2020-10-02',
 '2020-10-02',
 '2020-10-02',
 '2020-10-02',
 '2020-09-30',
 '2020-09-24',
 '2020-10-02',
 '2020-08-20',
 '2020-10-02',
 '2020-10-02',
 '2020-04-16',
 '2020-09-25',
 '2020-08-06',
 '2020-09-25',
 '2020-09-25',
 '2020-09-25',
 '2020-09-25',
 '2019-06-14',
 '2019-06-21',
 '2020-06-17',
 '2020-09-25',
 '2020-09-25',
 '2020-08-28',
 '2020-07-10',
 '2020-09-25',
 '2020-07-10',
 '2020-06-24',
 '2012',
 '2020-08-29',
 '1978-11-23',
 '2020-04-24',
 '2020-05-15',
 '2021-04-28',
 '2021-04-28',
 '2020-07-24',
 '2020-07-10',
 '2020-07-10',
 '2020-09-18',
 '2020-05-01',
 '2020-05-08',
 '2020-03-27',
 '2020-09-18',
 '2020-09-18',
 '2020-09-18',
 '2020-09-04',
 '2020-09-04',
 '2021-06-11',
 '2020-09-10',
 '2020-09-10',
 '2019-11-08',
 '2020-05-01',
 '2020-09-10',
 '2009-01-01',
 '2020-01-23',
 '2020-08-20',
 '2020-01-10',
 '2020-04-29',
 '2020-09-04',
 '2020-08-28',
 '2020-08-17',
 '2020-08-22',
 '2020-04-23',
 '2020-06-19',
 '2020-08-27',
 '2020-06-26',
 '2019-08-28',
 '2020-07-24',
 '2019-10-12',
 '2020-08-07',
 '2020-07-17',
 '2021-04-23',
 '2020-05-15',
 '2020-08-27',
 '2019-12-27',
 '2020-07-16',
 '2020-05-29',
 '2020-05-28',
 '2020-08-07',
 '2019-10-04',
 '2020-05-01',
 '2020-08-14',
 '2020-09-17',
 '2020-02-21',
 '2020-07-30',
 '2020-07-10',
 '2021-08-09',
 '2020-08-21',
 '2020-07-24',
 '2020-08-19',
 '2019-06-28',
 '2020-07-14',
 '2008-05-12',
 '2018-06-15',
 '2020-07-24',
 '2020-08-13',
 '2020-07-24',
 '2020-04-03',
 '2020-05-15',
 '2020-03-04',
 '2020-07-20',
 '2020-03-27',
 '2019-03-15',
 '2020-07-03',
 '2020-11-13',
 '2020-05-08',
 '2020-07-24',
 '2019-06-28',
 '2020-02-07',
 '2020-03-19',
 '2019-09-06',
 '2020-07-24',
 '2020-07-24',
 '2020-07-24',
 '2020-07-24',
 '2020-08-04',
 '2020-07-24',
 '2020-08-07',
 '2020-08-02',
 '2020-07-31',
 '2020-04-24',
 '2020-03-06',
 '2020-07-24',
 '2020-07-24',
 '2020-07-24',
 '2020-09-25',
 '2020-07-31',
 '2020-09-10',
 '2020-07-31',
 '2020-07-24',
 '2020-07-10',
 '2020-07-30',
 '2020-03-19',
 '2020-05-15',
 '2020-07-09',
 '2020-10-22',
 '2020-01-09',
 '2021-08-13',
 '2020-07-22',
 '2020-07-23',
 '2020-07-24',
 '2020-07-10',
 '2012-05-25',
 '2020-10-02',
 '2020-07-10',
 '2020-07-10',
 '2020-07-10',
 '2020-01-12',
 '2020-06-12',
 '2013-11-25',
 '2020-07-24',
 '2020-07-10',
 '2020-07-10',
 '2020-07-10',
 '2020-07-10',
 ...]
badlist
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

String slicing example

s = "Hello there how are you?"
s[1:5]
'ello'
s[::2]
'Hloteehwaeyu'

Function practice

  • Write a function f(s) which takes as input a string of the form “????-04-??” and returns “April”, a string of the form “????-05-??” and returns “May”, and for all other input strings returns “Neither”. (Hint. First check if the length of the string is incorrect. Try to make your code DRY.)

  • Apply this function f to every entry in the “Release Date” column using map.

  • Put the resulting pandas Series in a new column called “SpringMonth”. This is an example of Feature Engineering.

def f(s):
    sub = s[4:8]
    if len(s) != 10:
        return "Neither"
    elif sub == "-04-":
        return "April"
    elif sub == "-05-":
        return "May"
    else:
        return "Neither"
f("2022-04-13")
'April'
df["Release Date"].map(f)
0       Neither
1       Neither
2           May
3       Neither
4       Neither
         ...   
1551    Neither
1552    Neither
1553    Neither
1554    Neither
1555    Neither
Name: Release Date, Length: 1556, dtype: object

Creating a new column like this is what’s called feature engineering. The hope is that the new column will be easier to work with. The new column appears all the way at the right of the DataFrame.

df["SpringMonth"] = df["Release Date"].map(f)
df
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord SpringMonth
0 1 1 8 2021-07-23--2021-07-30 Beggin' 48,633,449 Måneskin 3377762 3Wrjm47oTz2sjIgck11l5e ['indie rock italiano', 'italian pop'] ... 0.8 -4.808 0.0504 0.127 0.359 134.002 211560 0.589 B Neither
1 2 2 3 2021-07-23--2021-07-30 STAY (with Justin Bieber) 47,248,719 The Kid LAROI 2230022 5HCyWlXZPP0y6Gqq8TgA20 ['australian hip hop'] ... 0.764 -5.484 0.0483 0.0383 0.103 169.928 141806 0.478 C#/Db Neither
2 3 1 11 2021-06-25--2021-07-02 good 4 u 40,162,559 Olivia Rodrigo 6266514 4ZtFanR9U6ndgddUvNcjcG ['pop'] ... 0.664 -5.044 0.154 0.335 0.0849 166.928 178147 0.688 A May
3 4 3 5 2021-07-02--2021-07-09 Bad Habits 37,799,456 Ed Sheeran 83293380 6PQ88X9TkUIAUIZJHW2upE ['pop', 'uk pop'] ... 0.897 -3.712 0.0348 0.0469 0.364 126.026 231041 0.591 B Neither
4 5 5 1 2021-07-23--2021-07-30 INDUSTRY BABY (feat. Jack Harlow) 33,948,454 Lil Nas X 5473565 27NovPIUIRrOZoCHxABJwK ['lgbtq+ hip hop', 'pop rap'] ... 0.704 -7.409 0.0615 0.0203 0.0501 149.995 212000 0.894 D#/Eb Neither
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1551 1552 195 1 2019-12-27--2020-01-03 New Rules 4,630,675 Dua Lipa 27167675 2ekn2ttSfGqwhhate0LSR0 ['dance pop', 'pop', 'uk pop'] ... 0.7 -6.021 0.0694 0.00261 0.153 116.073 209320 0.608 A Neither
1552 1553 196 1 2019-12-27--2020-01-03 Cheirosa - Ao Vivo 4,623,030 Jorge & Mateus 15019109 2PWjKmjyTZeDpmOUa3a5da ['sertanejo', 'sertanejo universitario'] ... 0.87 -3.123 0.0851 0.24 0.333 152.37 181930 0.714 B Neither
1553 1554 197 1 2019-12-27--2020-01-03 Havana (feat. Young Thug) 4,620,876 Camila Cabello 22698747 1rfofaqEpACxVEHIZBJe6W ['dance pop', 'electropop', 'pop', 'post-teen ... ... 0.523 -4.333 0.03 0.184 0.132 104.988 217307 0.394 D Neither
1554 1555 198 1 2019-12-27--2020-01-03 Surtada - Remix Brega Funk 4,607,385 Dadá Boladão, Tati Zaqui, OIK 208630 5F8ffc8KWKNawllr5WsW0r ['brega funk', 'funk carioca'] ... 0.55 -7.026 0.0587 0.249 0.182 154.064 152784 0.881 F Neither
1555 1556 199 1 2019-12-27--2020-01-03 Lover (Remix) [feat. Shawn Mendes] 4,595,450 Taylor Swift 42227614 3i9UVldZOE0aD0JnyfAZZ0 ['pop', 'post-teen pop'] ... 0.603 -7.176 0.064 0.433 0.0862 205.272 221307 0.422 G Neither

1556 rows × 24 columns

Release months

  • Import the Spotify dataset again, this time specifying the missing values using the na_values keyword argument.

  • Drop the rows which have missing data. A quick way to do this is using df = df.dropna().

  • Check: df should have 1545 rows and 23 columns.

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df = df.dropna()
df.shape
(1545, 23)
  • Convert the “Release Date” column to datetime values, using pd.to_datetime. Put the result in the same column (i.e., overwrite the current column).

  • Check: if you evaluate df.dtypes, you should see that the “Release Date” column has dtype as datetime64[ns].

df["Release Date"] = pd.to_datetime(df["Release Date"])
  • If s is a pandas Series with some datetime as its dtype, you can get the corresponding month by using the code s.dt.month_name(). Use this method to insert a new column in df called “Release Month”.

df["Release Date"].dt.month_name()
0        December
1            July
2             May
3            June
4            July
          ...    
1551         June
1552      October
1553      January
1554    September
1555     November
Name: Release Date, Length: 1545, dtype: object