Working with dates in pandas
Contents
Working with dates in pandas¶
The first part is about how to find missing values (for example, how to know that they are represented by a blank space in the Spotify dataset). The second part is about working with dates in pandas.
count¶
lists and strings both have a count
method.
my_list = [3,4,3135]
my_list.count(3)
1
s = "Hello there, how are you?"
s.count("e")
4
s.count("re")
2
s.count("er")
1
How to identify na_values?¶
The missing values in the spotify dataset are denoted by a blank space ” “. If you didn’t know to look for ” “, how could you identify it? Here is one approach.
Import the attached Spotify dataset as
df
. Do not specifyna_values
.Most of the cells in the “Release Date” column contain a string that looks like “2017-12-08”. Using list comprehension and the
count
method from above, make a listshortlist
containing all the entries which have fewer than two hyphen (minus) symbols.Do the same thing using a for loop and an if statement.
Check:
shortlist
should have 28 entries.One at a time, print the elements with fewer than two hyphen symbols. (Use a for loop for this part.)
import pandas as pd
df = pd.read_csv("../data/spotify_dataset.csv")
df.loc[:10,"Release Date"]
0 2017-12-08
1 2021-07-09
2 2021-05-21
3 2021-06-25
4 2021-07-23
5 2021-03-31
6 2021-04-09
7 2021-05-20
8 2021-06-04
9 2021-03-19
10 2020-10-01
Name: Release Date, dtype: object
shortlist = [s for s in df["Release Date"] if s.count("-") < 2]
shortlist
[' ',
'2014',
' ',
'1995',
' ',
'2015',
' ',
'1995',
'1963',
' ',
'1962',
'1963',
' ',
'2002',
'1990',
'1981',
'1957-09',
'1957-09',
'1994',
' ',
' ',
' ',
'2012',
'1981',
' ',
'2016',
'2016',
' ']
Here is an alternative approach.
altlist = [s for s in df["Release Date"] if len(s) < 8]
altlist
[' ',
'2014',
' ',
'1995',
' ',
'2015',
' ',
'1995',
'1963',
' ',
'1962',
'1963',
' ',
'2002',
'1990',
'1981',
'1957-09',
'1957-09',
'1994',
' ',
' ',
' ',
'2012',
'1981',
' ',
'2016',
'2016',
' ']
len(shortlist)
28
Compare the following for loop version with the list comprehension version.
[s for s in df["Release Date"] if s.count("-") < 2]
shortlist2 = []
for s in df["Release Date"]:
if s.count("-") < 2:
shortlist2.append(s)
shortlist2
[' ',
'2014',
' ',
'1995',
' ',
'2015',
' ',
'1995',
'1963',
' ',
'1962',
'1963',
' ',
'2002',
'1990',
'1981',
'1957-09',
'1957-09',
'1994',
' ',
' ',
' ',
'2012',
'1981',
' ',
'2016',
'2016',
' ']
Here we look at all of the values in shortlist
. You should not put print
statements in a list comprehension, so we use a for loop.
for s in shortlist:
print(s)
2014
1995
2015
1995
1963
1962
1963
2002
1990
1981
1957-09
1957-09
1994
2012
1981
2016
2016
There is a pandas function pd.to_datetime
that can be applied to both a string and also to a pandas Series.
d = pd.to_datetime("2022-04-13")
d
Timestamp('2022-04-13 00:00:00')
This new data type has lots of convenient methods and attributes.
d.month
4
d.day_name()
'Wednesday'
Notice the dtype
of the following pandas Series.
pd.to_datetime(df.loc[:10,"Release Date"])
0 2017-12-08
1 2021-07-09
2 2021-05-21
3 2021-06-25
4 2021-07-23
5 2021-03-31
6 2021-04-09
7 2021-05-20
8 2021-06-04
9 2021-03-19
10 2020-10-01
Name: Release Date, dtype: datetime64[ns]
What if we want to convert these into the day names? The first thing we might try does not work.
pd.to_datetime(df.loc[:10,"Release Date"]).day_name()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_38287/3366743804.py in <module>
----> 1 pd.to_datetime(df.loc[:10,"Release Date"]).day_name()
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
5485 ):
5486 return self[name]
-> 5487 return object.__getattribute__(self, name)
5488
5489 def __setattr__(self, name: str, value) -> None:
AttributeError: 'Series' object has no attribute 'day_name'
The trick is to use the dt
accessor. Here is the documentation, although I don’t find this particular documentation very helpful.
pd.to_datetime(df.loc[:10,"Release Date"]).dt.day_name()
0 Friday
1 Friday
2 Friday
3 Friday
4 Friday
5 Wednesday
6 Friday
7 Thursday
8 Friday
9 Friday
10 Thursday
Name: Release Date, dtype: object
If we try to apply pd.to_datetime
to the entire column, an error will be raised. This is because of the blank spaces in this column.
pd.to_datetime(df["Release Date"])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
2191 try:
-> 2192 values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
2193 # If tzaware, these values represent unix timestamps, so we
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
ParserError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_38287/4002128276.py in <module>
----> 1 pd.to_datetime(df["Release Date"])
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/tools/datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
881 result = result.tz_localize(tz) # type: ignore[call-arg]
882 elif isinstance(arg, ABCSeries):
--> 883 cache_array = _maybe_cache(arg, format, cache, convert_listlike)
884 if not cache_array.empty:
885 result = arg.map(cache_array)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/tools/datetimes.py in _maybe_cache(arg, format, cache, convert_listlike)
193 unique_dates = unique(arg)
194 if len(unique_dates) < len(arg):
--> 195 cache_dates = convert_listlike(unique_dates, format)
196 cache_array = Series(cache_dates, index=unique_dates)
197 # GH#39882 and GH#35888 in case of None and NaT we get duplicates
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/tools/datetimes.py in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
406 errors=errors,
407 require_iso8601=require_iso8601,
--> 408 allow_object=True,
409 )
410
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
2196 return values.view("i8"), tz_parsed
2197 except (ValueError, TypeError):
-> 2198 raise err
2199
2200 if tz_parsed is not None:
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
2185 yearfirst=yearfirst,
2186 require_iso8601=require_iso8601,
-> 2187 allow_mixed=allow_mixed,
2188 )
2189 result = result.reshape(data.shape, order=order)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib._array_to_datetime_object()
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib._array_to_datetime_object()
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/dateutil/parser/_parser.py in parse(timestr, parserinfo, **kwargs)
1366 return parser(parserinfo).parse(timestr, **kwargs)
1367 else:
-> 1368 return DEFAULTPARSER.parse(timestr, **kwargs)
1369
1370
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/dateutil/parser/_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
644
645 if len(res) == 0:
--> 646 raise ParserError("String does not contain a date: %s", timestr)
647
648 try:
ParserError: String does not contain a date:
shortlist
[' ',
'2014',
' ',
'1995',
' ',
'2015',
' ',
'1995',
'1963',
' ',
'1962',
'1963',
' ',
'2002',
'1990',
'1981',
'1957-09',
'1957-09',
'1994',
' ',
' ',
' ',
'2012',
'1981',
' ',
'2016',
'2016',
' ']
Give an example of an element
x
inshortlist
for whichpd.to_datetime(x)
raises an error, and an example of an elementy
inshortlist
for whichpd.to_datetime(y)
does not raise an error.
pd.to_datetime(" ")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
2191 try:
-> 2192 values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
2193 # If tzaware, these values represent unix timestamps, so we
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
ParserError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_38287/4212689216.py in <module>
----> 1 pd.to_datetime(" ")
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/tools/datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
912 result = convert_listlike(arg, format)
913 else:
--> 914 result = convert_listlike(np.array([arg]), format)[0]
915
916 # error: Incompatible return value type (got "Union[Timestamp, NaTType,
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/tools/datetimes.py in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
406 errors=errors,
407 require_iso8601=require_iso8601,
--> 408 allow_object=True,
409 )
410
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
2196 return values.view("i8"), tz_parsed
2197 except (ValueError, TypeError):
-> 2198 raise err
2199
2200 if tz_parsed is not None:
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
2185 yearfirst=yearfirst,
2186 require_iso8601=require_iso8601,
-> 2187 allow_mixed=allow_mixed,
2188 )
2189 result = result.reshape(data.shape, order=order)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib._array_to_datetime_object()
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslib.pyx in pandas._libs.tslib._array_to_datetime_object()
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/dateutil/parser/_parser.py in parse(timestr, parserinfo, **kwargs)
1366 return parser(parserinfo).parse(timestr, **kwargs)
1367 else:
-> 1368 return DEFAULTPARSER.parse(timestr, **kwargs)
1369
1370
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/dateutil/parser/_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
644
645 if len(res) == 0:
--> 646 raise ParserError("String does not contain a date: %s", timestr)
647
648 try:
ParserError: String does not contain a date:
pd.to_datetime("1957-09")
Timestamp('1957-09-01 00:00:00')
try and except¶
We would like to find the “bad” cells in df["Release Date"]
using something like the following pseudocode:
badlist = []
for x in df["Release Date"]:
if pd.to_datetime(x) raises an error:
badlist.append(x)
The correct way to do that does not involve an if statement. Instead it involves try
and except
. Here we also put the “good” values into their own list.
badlist = []
goodlist = []
for x in df["Release Date"]:
try:
pd.to_datetime(x)
goodlist.append(x)
except:
badlist.append(x)
goodlist
['2017-12-08',
'2021-07-09',
'2021-05-21',
'2021-06-25',
'2021-07-23',
'2021-03-31',
'2021-04-09',
'2021-05-20',
'2021-06-04',
'2021-03-19',
'2020-10-01',
'2021-05-28',
'2021-07-09',
'2021-03-19',
'2021-06-04',
'2021-05-21',
'2021-05-21',
'2021-04-23',
'2021-06-24',
'2021-06-25',
'2021-07-08',
'2021-06-25',
'2021-05-21',
'2020-08-07',
'2020-03-20',
'2021-05-21',
'2021-06-24',
'2021-07-09',
'2021-01-15',
'2021-01-06',
'2021-03-05',
'2021-07-23',
'2021-05-21',
'2021-04-16',
'2019-12-13',
'2021-04-09',
'2021-06-25',
'2021-07-09',
'2020-11-27',
'2021-04-29',
'2021-05-21',
'2021-06-18',
'2020-11-20',
'2020-09-16',
'2021-05-20',
'2021-03-26',
'2021-07-08',
'2021-06-15',
'2021-05-14',
'2021-06-30',
'2021-04-01',
'2021-07-30',
'2021-06-25',
'2021-05-21',
'2021-05-15',
'2021-06-01',
'2021-02-14',
'2021-06-11',
'2021-05-05',
'2021-06-11',
'2021-01-15',
'2021-05-21',
'2021-04-23',
'2020-03-27',
'2021-08-13',
'2018-02-02',
'2013-04-19',
'2021-07-08',
'2019-05-17',
'2021-07-23',
'2021-06-02',
'2021-03-26',
'2021-05-13',
'2017-12-14',
'2021-03-19',
'2020-12-04',
'2021-04-23',
'2021-02-04',
'2019-10-17',
'2020-11-06',
'2021-03-05',
'2021-06-10',
'2021-08-13',
'2021-01-15',
'2020-07-03',
'2021-02-26',
'2019-06-03',
'2021-07-22',
'2019-03-07',
'2021-02-05',
'2020-10-30',
'2019-09-06',
'2021-07-16',
'2021-04-20',
'2021-08-12',
'2021-06-25',
'2021-04-27',
'2021-07-08',
'2019-11-22',
'2020-10-01',
'2021-06-04',
'2021-06-16',
'2017-06-23',
'2020-08-14',
'2010-10-05',
'2020-11-20',
'2021-04-15',
'2021-06-03',
'2021-07-09',
'2018-04-19',
'2018-12-14',
'2019-05-24',
'2019-11-07',
'2021-06-04',
'2021-06-17',
'2017-03-03',
'2019-06-21',
'2021-08-13',
'2017-03-03',
'2021-06-21',
'2020-11-27',
'2020-04-17',
'2020-03-20',
'2020-04-17',
'2014-01-01',
'2014-09-09',
'1991-09-26',
'2021-05-21',
'2015-10-30',
'2021-06-01',
'2021-05-07',
'2021-06-11',
'2013-09-17',
'2021-04-30',
'2014-09-19',
'2021-01-08',
'2021-06-19',
'2021-02-05',
'2020-11-05',
'2021-07-21',
'1975-11-21',
'2020-12-10',
'2020-10-30',
'2021-01-15',
'2021-01-08',
'2021-07-14',
'2018-10-05',
'2021-05-28',
'2021-04-29',
'2014-01-01',
'2000-07-10',
'2021-07-13',
'2021-06-01',
'2013-01-01',
'2021-02-26',
'2021-08-13',
'2020-07-03',
'2014',
'2021-06-25',
'2021-06-18',
'2018-12-10',
'2021-04-02',
'2021-08-13',
'2021-02-12',
'2020-07-03',
'2021-05-21',
'2020-08-07',
'2004-06-15',
'1983-06-17',
'1977-02-04',
'2016-10-28',
'2021-07-22',
'2021-07-16',
'2021-07-16',
'2020-02-08',
'2021-05-21',
'1980-06-27',
'2016-05-06',
'2019-12-13',
'2021-06-10',
'2021-06-04',
'2021-06-25',
'2013-08-30',
'2020-05-15',
'2019-07-26',
'2021-01-07',
'2015-08-28',
'2020-12-25',
'2021-07-23',
'2021-07-06',
'2019-02-08',
'2020-12-04',
'2021-02-25',
'2019-12-19',
'2021-07-18',
'2021-06-04',
'2021-07-22',
'2000-05-23',
'2021-07-31',
'2021-07-31',
'2021-07-16',
'2021-07-31',
'2021-07-31',
'2021-06-18',
'2021-07-31',
'2021-07-31',
'2021-06-25',
'2021-06-24',
'2021-04-22',
'2021-07-31',
'2021-01-28',
'2021-06-17',
'2019-07-26',
'2021-08-12',
'2021-07-16',
'2021-07-02',
'2021-04-09',
'2021-05-13',
'2021-03-12',
'2021-02-11',
'2020-11-20',
'2021-07-09',
'2021-03-12',
'2021-07-09',
'2021-05-21',
'2021-05-26',
'2021-06-11',
'2021-07-09',
'2021-07-09',
'2021-03-18',
'2021-03-27',
'2021-03-05',
'2021-06-25',
'2016-07-29',
'1995',
'2021-07-01',
'2021-06-25',
'2021-07-02',
'2021-06-25',
'2021-06-10',
'2021-04-30',
'2021-04-23',
'2021-06-10',
'2021-07-30',
'2021-07-01',
'2020-08-20',
'2021-01-29',
'2020-11-20',
'2011-11-11',
'2020-10-22',
'2021-05-13',
'2021-06-09',
'2013-09-09',
'2021-06-25',
'2021-06-25',
'2021-06-25',
'2021-06-25',
'2021-06-25',
'2021-06-25',
'2021-06-25',
'2021-06-25',
'2021-06-25',
'2021-06-25',
'2021-06-25',
'2021-06-25',
'2021-06-25',
'2021-06-11',
'2021-06-25',
'2021-06-25',
'2021-06-11',
'2020-07-24',
'2020-06-11',
'2021-06-18',
'2020-07-31',
'2021-06-17',
'2021-06-04',
'2021-05-14',
'2021-05-14',
'2017-08-25',
'2021-06-11',
'2019-12-06',
'2021-01-21',
'2012-10-09',
'2021-08-13',
'2020-09-25',
'2010-10-05',
'2019-10-04',
'2020-03-27',
'2020-08-14',
'2021-05-13',
'2021-02-03',
'2000-10-24',
'2014-12-09',
'2018-08-03',
'2017-04-07',
'2021-04-16',
'2021-05-14',
'2021-05-21',
'2021-06-11',
'2021-06-11',
'2021-06-11',
'2020-09-18',
'2021-06-11',
'2020-11-02',
'2021-06-11',
'2021-06-11',
'2021-08-13',
'2020-11-27',
'2021-04-09',
'2021-08-13',
'2021-05-21',
'2021-06-11',
'2020-05-22',
'2021-06-04',
'2021-07-16',
'2021-06-04',
'2021-05-31',
'2021-05-14',
'2021-05-06',
'2021-04-16',
'2020-10-08',
'2019-12-13',
'2021-03-19',
'2021-08-13',
'2021-01-15',
'2021-05-28',
'2020-12-18',
'2018-12-10',
'2021-05-28',
'2019-04-12',
'2021-01-22',
'2020-01-17',
'2020-02-07',
'2011-01-01',
'2020-09-18',
'2021-04-02',
'2021-05-14',
'2021-06-04',
'2021-04-01',
'2020-11-06',
'2021-05-21',
'2021-05-21',
'2021-05-14',
'2021-05-14',
'2021-01-21',
'2021-03-10',
'2021-04-30',
'2021-05-14',
'2021-03-05',
'2021-04-30',
'2020-02-14',
'2021-03-10',
'2021-04-02',
'2021-05-21',
'2021-03-13',
'2021-05-21',
'2021-05-14',
'2021-05-14',
'2021-05-14',
'2021-05-14',
'2021-05-14',
'2021-05-14',
'2021-04-30',
'2021-04-28',
'2021-03-11',
'2012-04-30',
'2019-12-13',
'2020-12-15',
'2020-11-20',
'2021-08-09',
'2020-10-16',
'2005-02-22',
'2019-10-17',
'2021-04-30',
'2021-04-22',
'2021-08-13',
'2021-05-04',
'2021-03-10',
'2020-11-13',
'2020-07-17',
'2020-01-10',
'2021-05-13',
'2020-07-10',
'2021-02-08',
'2021-01-22',
'2021-03-01',
'2020-07-10',
'2019-03-08',
'2021-04-22',
'2020-08-25',
'2021-04-30',
'2021-04-29',
'2020-07-17',
'2021-04-30',
'2021-05-21',
'2021-03-12',
'2021-04-09',
'2021-08-13',
'2021-04-09',
'2021-04-02',
'2020-11-06',
'2021-04-02',
'2020-09-23',
'2021-03-19',
'2021-04-23',
'2021-01-15',
'2020-04-24',
'2020-10-15',
'2016-10-28',
'2020-11-20',
'2020-03-20',
'2018-03-16',
'2021-03-26',
'2020-08-21',
'2012-12-07',
'2021-08-12',
'2015-05-15',
'2021-04-16',
'2021-04-16',
'2021-04-09',
'2020-11-12',
'2021-02-19',
'2020-12-11',
'2013-09-09',
'2021-08-13',
'2019-06-28',
'2021-04-09',
'2021-04-09',
'2007-06-12',
'2021-04-09',
'2021-04-09',
'2021-03-26',
'2021-04-09',
'2021-04-09',
'2021-04-09',
'2021-04-09',
'1998-05-01',
'2021-04-09',
'2021-04-09',
'2021-04-09',
'2021-08-09',
'2020-09-04',
'2021-04-09',
'2021-04-09',
'2021-04-01',
'2021-03-26',
'2021-03-12',
'2021-02-04',
'2020-07-20',
'2019-10-04',
'2020-07-22',
'2021-03-19',
'2020-05-08',
'2008-09-23',
'2020-11-26',
'2021-03-04',
'2012-01-01',
'2021-03-19',
'2021-03-19',
'2020-12-11',
'1985-06-01',
'2020-11-25',
'2021-03-19',
'2021-03-19',
'2021-03-19',
'2019-11-07',
'2020-09-04',
'2021-03-19',
'2021-01-12',
'2021-03-05',
'2020-11-26',
'2020-11-20',
'2021-03-19',
'2016-11-25',
'2020-07-24',
'2020-11-13',
'2021-03-19',
'2010-01-01',
'2021-03-19',
'2020-11-20',
'2021-03-19',
'2020-08-21',
'2021-01-18',
'2020-09-21',
'2016-02-05',
'2020-11-20',
'2021-03-12',
'2021-03-12',
'2021-03-04',
'2018-11-01',
'2021-03-12',
'2020-06-12',
'2020-10-16',
'2019-11-07',
'2018-03-16',
'2015-07-17',
'2016-11-25',
'2020-07-03',
'2019-10-04',
'2019-06-14',
'2020-10-02',
'2021-03-11',
'2021-08-13',
'2021-02-26',
'2021-03-05',
'2021-03-03',
'2021-01-22',
'2021-03-04',
'2021-03-04',
'2020-09-02',
'2020-05-08',
'2015',
'2015-10-16',
'2020-03-20',
'2021-02-25',
'2021-02-19',
'2021-02-26',
'2020-10-23',
'2020-08-14',
'2019-10-10',
'2018-04-27',
'2021-02-26',
'2018-10-04',
'2020-11-27',
'2020-02-14',
'2019-08-02',
'2014-06-21',
'2021-02-26',
'2021-02-18',
'2021-08-13',
'2021-01-22',
'2020-10-22',
'2020-11-06',
'2021-01-21',
'2020-11-13',
'2020-09-04',
'2020-10-30',
'2019-05-10',
'2021-08-12',
'2020-11-27',
'2020-11-05',
'2020-07-09',
'2020-12-17',
'2018-03-16',
'2014-06-21',
'2020-04-03',
'2015-08-28',
'2019-11-26',
'1995',
'2017-08-25',
'2021-01-15',
'2021-01-21',
'2021-02-04',
'2015-08-28',
'2021-01-29',
'2018-03-30',
'2020-08-14',
'2021-02-05',
'2013-04-01',
'2021-01-28',
'2020-05-19',
'2020-03-20',
'2020-10-30',
'2018-06-29',
'2020-11-20',
'2020-04-03',
'2021-01-14',
'2021-01-29',
'2021-01-08',
'2019-11-27',
'2020-07-09',
'2021-01-08',
'2020-04-24',
'2021-01-22',
'2020-11-07',
'2020-11-20',
'2021-03-12',
'2020-12-09',
'2020-09-17',
'2020-10-26',
'2020-11-13',
'2020-12-11',
'2020-05-08',
'2020-12-08',
'2021-01-15',
'2021-08-11',
'2020-06-19',
'2020-09-18',
'2020-10-02',
'2020-06-11',
'2020-06-05',
'2021-01-08',
'2021-01-08',
'2021-01-08',
'2020-11-27',
'2020-10-31',
'2020-07-03',
'2021-01-08',
'2021-01-08',
'2020-04-17',
'2020-10-02',
'2020-05-01',
'2020-10-29',
'2021-01-01',
'2020-12-11',
'2019-01-06',
'2020-10-16',
'2018-08-17',
'1978-11-10',
'2020-03-19',
'2020-05-29',
'2020-12-11',
'2019-10-11',
'2020-10-26',
'2020-09-25',
'1994-11-01',
'1984-11-29',
'2014-11-24',
'1964-10-19',
'1957-12-02',
'2012-11-09',
'1963-11-24',
'2013-10-29',
'2002-09-24',
'1959-01-01',
'1994-11-08',
'2011-01-01',
'1963',
'2004-01-01',
'1980-05-16',
'2012-11-09',
'2010-10-05',
'1962',
'1942-01-01',
'2020-12-25',
'1965-01-01',
'1988-09-05',
'1991-11-25',
'1957-10-15',
'2021-08-06',
'2021-08-06',
'2010-11-30',
'1999-08-20',
'1963',
'1973-10-05',
'2020-12-11',
'2019-12-13',
'2020-11-13',
'1983-11-01',
'2016-12-16',
'2014-12-05',
'2018-11-01',
'2000-11-14',
'2016-10-28',
'2019-11-08',
'2020-10-15',
'2017-10-06',
'2020-12-18',
'2021-08-06',
'2012-11-09',
'2014-10-10',
'2020-12-11',
'1970-10-15',
'2002',
'1990',
'2020-12-11',
'2021-08-06',
'2020-12-07',
'1959-01-01',
'1981',
'2013-12-02',
'2021-08-06',
'2020-10-30',
'2015-01-09',
'1959-01-01',
'1957-09',
'1957-09',
'2020-12-04',
'2006-01-01',
'2017-12-06',
'1990-04-03',
'2014-12-09',
'2019-12-06',
'2018-11-15',
'1998-10-30',
'2020-12-11',
'1994',
'2020-10-30',
'2020-12-11',
'2020-12-11',
'2020-12-11',
'2020-12-11',
'2020-12-11',
'2020-12-11',
'2020-12-11',
'2020-12-11',
'2020-12-11',
'2020-12-11',
'2020-12-11',
'2020-12-11',
'2020-12-11',
'2020-12-11',
'2020-12-11',
'2020-12-11',
'2020-11-27',
'2020-12-11',
'2020-12-11',
'2020-07-24',
'2020-12-02',
'2020-12-04',
'2020-12-04',
'2020-06-24',
'2020-11-27',
'2020-12-03',
'2020-11-27',
'2020-11-20',
'2020-07-24',
'2020-02-29',
'2020-11-27',
'2020-07-24',
'2020-12-04',
'2020-11-06',
'2020-11-27',
'2020-11-27',
'2020-11-27',
'2020-11-27',
'2020-11-27',
'2020-11-27',
'2020-11-27',
'1980-05-16',
'2020-11-20',
'2020-11-20',
'2020-11-20',
'2020-11-20',
'2020-09-29',
'2020-11-27',
'2020-11-27',
'2020-05-09',
'2020-11-20',
'2020-12-18',
'2020-08-21',
'2020-11-13',
'2020-12-18',
'2020-01-24',
'2020-12-18',
'2020-06-12',
'2020-11-06',
'2020-07-10',
'2020-10-02',
'2020-10-02',
'2013-03-15',
'2020-12-18',
'2020-10-02',
'2019-03-22',
'2019-10-04',
'2020-03-27',
'2020-05-01',
'2020-04-21',
'2020-11-13',
'2020-07-10',
'2020-10-28',
'2020-09-18',
'2020-11-13',
'2020-03-13',
'2020-10-30',
'2020-10-23',
'2020-11-06',
'2020-09-15',
'2020-11-13',
'2006-01-01',
'2020-07-16',
'2020-10-30',
'2019-10-04',
'2020-05-29',
'2018-10-08',
'2020-12-02',
'2020-10-23',
'2018-10-26',
'2020-09-11',
'2021-07-30',
'2020-08-14',
'2020-10-30',
'2020-10-30',
'2020-10-30',
'2020-11-06',
'2020-08-20',
'2020-08-21',
'2020-08-17',
'2020-10-02',
'2020-07-03',
'2020-10-30',
'2020-10-30',
'2019-06-28',
'2020-03-26',
'2020-03-13',
'2021-08-09',
'2020-10-30',
'2020-10-30',
'2020-10-30',
'2017-09-27',
'2019-01-11',
'2020-10-29',
'2020-10-30',
'1979-07-27',
'2020-09-18',
'2020-08-21',
'2020-09-04',
'2020-09-25',
'2020-10-02',
'2020-09-04',
'2019-02-08',
'2020-04-10',
'2020-10-13',
'2021-08-13',
'2020-04-24',
'2020-07-23',
'2013-02-17',
'2015-08-14',
'2020-05-29',
'2011-01-24',
'2019-11-08',
'2020-10-09',
'2020-10-02',
'2020-10-16',
'2020-08-15',
'2020-10-02',
'2020-10-09',
'2020-05-28',
'2020-06-10',
'2020-02-29',
'2021-03-19',
'2020-05-22',
'2020-03-26',
'2020-05-15',
'2019-05-17',
'2021-08-03',
'2019-11-15',
'2020-10-02',
'2020-09-25',
'2020-09-25',
'2020-06-30',
'2020-06-26',
'2019-06-28',
'2020-10-02',
'2020-10-02',
'2020-10-02',
'2020-10-02',
'2020-10-02',
'2020-10-02',
'2020-10-02',
'2020-09-30',
'2020-09-24',
'2020-10-02',
'2020-08-20',
'2020-10-02',
'2020-10-02',
'2020-04-16',
'2020-09-25',
'2020-08-06',
'2020-09-25',
'2020-09-25',
'2020-09-25',
'2020-09-25',
'2019-06-14',
'2019-06-21',
'2020-06-17',
'2020-09-25',
'2020-09-25',
'2020-08-28',
'2020-07-10',
'2020-09-25',
'2020-07-10',
'2020-06-24',
'2012',
'2020-08-29',
'1978-11-23',
'2020-04-24',
'2020-05-15',
'2021-04-28',
'2021-04-28',
'2020-07-24',
'2020-07-10',
'2020-07-10',
'2020-09-18',
'2020-05-01',
'2020-05-08',
'2020-03-27',
'2020-09-18',
'2020-09-18',
'2020-09-18',
'2020-09-04',
'2020-09-04',
'2021-06-11',
'2020-09-10',
'2020-09-10',
'2019-11-08',
'2020-05-01',
'2020-09-10',
'2009-01-01',
'2020-01-23',
'2020-08-20',
'2020-01-10',
'2020-04-29',
'2020-09-04',
'2020-08-28',
'2020-08-17',
'2020-08-22',
'2020-04-23',
'2020-06-19',
'2020-08-27',
'2020-06-26',
'2019-08-28',
'2020-07-24',
'2019-10-12',
'2020-08-07',
'2020-07-17',
'2021-04-23',
'2020-05-15',
'2020-08-27',
'2019-12-27',
'2020-07-16',
'2020-05-29',
'2020-05-28',
'2020-08-07',
'2019-10-04',
'2020-05-01',
'2020-08-14',
'2020-09-17',
'2020-02-21',
'2020-07-30',
'2020-07-10',
'2021-08-09',
'2020-08-21',
'2020-07-24',
'2020-08-19',
'2019-06-28',
'2020-07-14',
'2008-05-12',
'2018-06-15',
'2020-07-24',
'2020-08-13',
'2020-07-24',
'2020-04-03',
'2020-05-15',
'2020-03-04',
'2020-07-20',
'2020-03-27',
'2019-03-15',
'2020-07-03',
'2020-11-13',
'2020-05-08',
'2020-07-24',
'2019-06-28',
'2020-02-07',
'2020-03-19',
'2019-09-06',
'2020-07-24',
'2020-07-24',
'2020-07-24',
'2020-07-24',
'2020-08-04',
'2020-07-24',
'2020-08-07',
'2020-08-02',
'2020-07-31',
'2020-04-24',
'2020-03-06',
'2020-07-24',
'2020-07-24',
'2020-07-24',
'2020-09-25',
'2020-07-31',
'2020-09-10',
'2020-07-31',
'2020-07-24',
'2020-07-10',
'2020-07-30',
'2020-03-19',
'2020-05-15',
'2020-07-09',
'2020-10-22',
'2020-01-09',
'2021-08-13',
'2020-07-22',
'2020-07-23',
'2020-07-24',
'2020-07-10',
'2012-05-25',
'2020-10-02',
'2020-07-10',
'2020-07-10',
'2020-07-10',
'2020-01-12',
'2020-06-12',
'2013-11-25',
'2020-07-24',
'2020-07-10',
'2020-07-10',
'2020-07-10',
'2020-07-10',
...]
badlist
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
Function practice¶
Write a function
f(s)
which takes as input a string of the form “????-04-??” and returns “April”, a string of the form “????-05-??” and returns “May”, and for all other input strings returns “Neither”. (Hint. First check if the length of the string is incorrect. Try to make your code DRY.)Apply this function
f
to every entry in the “Release Date” column usingmap
.Put the resulting pandas Series in a new column called “SpringMonth”. This is an example of Feature Engineering.
def f(s):
sub = s[4:8]
if len(s) != 10:
return "Neither"
elif sub == "-04-":
return "April"
elif sub == "-05-":
return "May"
else:
return "Neither"
f("2022-04-13")
'April'
df["Release Date"].map(f)
0 Neither
1 Neither
2 May
3 Neither
4 Neither
...
1551 Neither
1552 Neither
1553 Neither
1554 Neither
1555 Neither
Name: Release Date, Length: 1556, dtype: object
Creating a new column like this is what’s called feature engineering. The hope is that the new column will be easier to work with. The new column appears all the way at the right of the DataFrame.
df["SpringMonth"] = df["Release Date"].map(f)
df
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | SpringMonth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 8 | 2021-07-23--2021-07-30 | Beggin' | 48,633,449 | Måneskin | 3377762 | 3Wrjm47oTz2sjIgck11l5e | ['indie rock italiano', 'italian pop'] | ... | 0.8 | -4.808 | 0.0504 | 0.127 | 0.359 | 134.002 | 211560 | 0.589 | B | Neither |
1 | 2 | 2 | 3 | 2021-07-23--2021-07-30 | STAY (with Justin Bieber) | 47,248,719 | The Kid LAROI | 2230022 | 5HCyWlXZPP0y6Gqq8TgA20 | ['australian hip hop'] | ... | 0.764 | -5.484 | 0.0483 | 0.0383 | 0.103 | 169.928 | 141806 | 0.478 | C#/Db | Neither |
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40,162,559 | Olivia Rodrigo | 6266514 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.664 | -5.044 | 0.154 | 0.335 | 0.0849 | 166.928 | 178147 | 0.688 | A | May |
3 | 4 | 3 | 5 | 2021-07-02--2021-07-09 | Bad Habits | 37,799,456 | Ed Sheeran | 83293380 | 6PQ88X9TkUIAUIZJHW2upE | ['pop', 'uk pop'] | ... | 0.897 | -3.712 | 0.0348 | 0.0469 | 0.364 | 126.026 | 231041 | 0.591 | B | Neither |
4 | 5 | 5 | 1 | 2021-07-23--2021-07-30 | INDUSTRY BABY (feat. Jack Harlow) | 33,948,454 | Lil Nas X | 5473565 | 27NovPIUIRrOZoCHxABJwK | ['lgbtq+ hip hop', 'pop rap'] | ... | 0.704 | -7.409 | 0.0615 | 0.0203 | 0.0501 | 149.995 | 212000 | 0.894 | D#/Eb | Neither |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1551 | 1552 | 195 | 1 | 2019-12-27--2020-01-03 | New Rules | 4,630,675 | Dua Lipa | 27167675 | 2ekn2ttSfGqwhhate0LSR0 | ['dance pop', 'pop', 'uk pop'] | ... | 0.7 | -6.021 | 0.0694 | 0.00261 | 0.153 | 116.073 | 209320 | 0.608 | A | Neither |
1552 | 1553 | 196 | 1 | 2019-12-27--2020-01-03 | Cheirosa - Ao Vivo | 4,623,030 | Jorge & Mateus | 15019109 | 2PWjKmjyTZeDpmOUa3a5da | ['sertanejo', 'sertanejo universitario'] | ... | 0.87 | -3.123 | 0.0851 | 0.24 | 0.333 | 152.37 | 181930 | 0.714 | B | Neither |
1553 | 1554 | 197 | 1 | 2019-12-27--2020-01-03 | Havana (feat. Young Thug) | 4,620,876 | Camila Cabello | 22698747 | 1rfofaqEpACxVEHIZBJe6W | ['dance pop', 'electropop', 'pop', 'post-teen ... | ... | 0.523 | -4.333 | 0.03 | 0.184 | 0.132 | 104.988 | 217307 | 0.394 | D | Neither |
1554 | 1555 | 198 | 1 | 2019-12-27--2020-01-03 | Surtada - Remix Brega Funk | 4,607,385 | Dadá Boladão, Tati Zaqui, OIK | 208630 | 5F8ffc8KWKNawllr5WsW0r | ['brega funk', 'funk carioca'] | ... | 0.55 | -7.026 | 0.0587 | 0.249 | 0.182 | 154.064 | 152784 | 0.881 | F | Neither |
1555 | 1556 | 199 | 1 | 2019-12-27--2020-01-03 | Lover (Remix) [feat. Shawn Mendes] | 4,595,450 | Taylor Swift | 42227614 | 3i9UVldZOE0aD0JnyfAZZ0 | ['pop', 'post-teen pop'] | ... | 0.603 | -7.176 | 0.064 | 0.433 | 0.0862 | 205.272 | 221307 | 0.422 | G | Neither |
1556 rows × 24 columns
Release months¶
Import the Spotify dataset again, this time specifying the missing values using the
na_values
keyword argument.Drop the rows which have missing data. A quick way to do this is using
df = df.dropna()
.Check:
df
should have 1545 rows and 23 columns.
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df = df.dropna()
df.shape
(1545, 23)
Convert the “Release Date” column to datetime values, using
pd.to_datetime
. Put the result in the same column (i.e., overwrite the current column).Check: if you evaluate
df.dtypes
, you should see that the “Release Date” column hasdtype
asdatetime64[ns]
.
df["Release Date"] = pd.to_datetime(df["Release Date"])
If
s
is a pandas Series with some datetime as itsdtype
, you can get the corresponding month by using the codes.dt.month_name()
. Use this method to insert a new column indf
called “Release Month”.
df["Release Date"].dt.month_name()
0 December
1 July
2 May
3 June
4 July
...
1551 June
1552 October
1553 January
1554 September
1555 November
Name: Release Date, Length: 1545, dtype: object