Week 0 Friday
Week 0 Friday#
Starting in Week 1, we will have more specific topics. Today, we went through a variety of basic topics, especially related to the type of an object in Python.
We start by importing pandas. The pandas library is the most important library for Math 10. (The second-most important library is probably scikit-learn, which we will use extensively in the Machine Learning portion of Math 10.)
In theory, you could give pandas an abbreviation other than pd
, or not use any abbreviation at all, but in practice, everyone uses pd
, and we will also always use pd
.
import pandas as pd
Just to emphasize that pd
is now defined but pandas
is not, because of our import statement.
pandas
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In [4], line 1
----> 1 pandas
NameError: name 'pandas' is not defined
I use the terms “module” and “library” moreorless interchangeably. I usually refer to pandas as a “library”; here Python is referring to it as a “module”.
pd
<module 'pandas' from '/shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/__init__.py'>
One of the most important concepts at the beginning of Math 10 is the concept of the type
of an object in Python. Different types of objects have different functionality associated with them. To use the read_csv
function defined by pandas, we need to use as an argument an element with the type of string. Here we forget to use quotation marks (i.e., we forgot to turn vend.csv into a string), so that’s why we get an error.
pd.read_csv(vend.csv)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In [6], line 1
----> 1 pd.read_csv(vend.csv)
NameError: name 'vend' is not defined
Instead of using vend.csv
as our argument, we use "vend.csv"
. This "vend.csv"
is a string.
pd.read_csv("vend.csv")
Status | Device ID | Location | Machine | Product | Category | Transaction | TransDate | Type | RCoil | RPrice | RQty | MCoil | MPrice | MQty | LineTotal | TransTotal | Prcd Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14515778905 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
1 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516018629 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 5.0 | 1/1/2022 |
2 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516018629 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 5.0 | 1/1/2022 |
3 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516020373 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 1.5 | 1/1/2022 |
4 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516021756 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6440 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | Lindens - Chocolate Chippers | Food | 15603201222 | Wednesday, August 31, 2022 | Credit | 122 | 2.0 | 1 | 122 | 2.0 | 1 | 2.0 | 6.0 | 8/31/2022 |
6441 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | Wonderful Pistachios - Variety | Food | 15603201222 | Wednesday, August 31, 2022 | Credit | 131 | 2.0 | 1 | 131 | 2.0 | 1 | 2.0 | 6.0 | 8/31/2022 |
6442 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | Hungry Buddha - Chocolate Chip | Food | 15603201222 | Wednesday, August 31, 2022 | Credit | 137 | 2.0 | 1 | 137 | 2.0 | 1 | 2.0 | 6.0 | 8/31/2022 |
6443 | Processed | VJ300320609 | GuttenPlans | GuttenPlans x1367 | Snapple Tea - Lemon | Non Carbonated | 15603853105 | Wednesday, August 31, 2022 | Credit | 145 | 2.5 | 1 | 145 | 2.5 | 1 | 2.5 | 2.5 | 8/31/2022 |
6444 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | Goldfish Baked - Cheddar | Food | 15603921383 | Wednesday, August 31, 2022 | Cash | 125 | 1.5 | 1 | 125 | 1.5 | 1 | 1.5 | 1.5 | 8/31/2022 |
6445 rows × 18 columns
To later access the contents of this dataset, we should store it in some variable name. A good default choice is df
.
df = pd.read_csv("vend.csv")
In the worksheet from yesterday, we only did a few things with this dataset. One thing we did was to look at its first 10 rows using the head
method.
df.head(10)
Status | Device ID | Location | Machine | Product | Category | Transaction | TransDate | Type | RCoil | RPrice | RQty | MCoil | MPrice | MQty | LineTotal | TransTotal | Prcd Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14515778905 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
1 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516018629 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 5.0 | 1/1/2022 |
2 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516018629 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 5.0 | 1/1/2022 |
3 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516020373 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 1.5 | 1/1/2022 |
4 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516021756 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
5 | Processed | VJ300205292 | Brunswick Sq Mall | BSQ Mall x1364 - Zales | Bai Antioxidant - Brasilia BB | Non Carbonated | 14517568743 | Sunday, January 2, 2022 | Cash | 146 | 2.5 | 1 | 146 | 2.5 | 1 | 2.5 | 2.5 | 1/2/2022 |
6 | Processed | VJ300205292 | Brunswick Sq Mall | BSQ Mall x1364 - Zales | Miss Vickie's Potato Chip - Sea Salt & Vinega | Food | 14518731524 | Monday, January 3, 2022 | Cash | 114 | 1.5 | 1 | 114 | 1.5 | 1 | 1.5 | 1.5 | 1/2/2022 |
7 | Processed | VJ300320686 | Earle Asphalt | Earle Asphalt x1371 | Miss Vickie's Potato Chip - Lime & Cracked Pe | Food | 14519162059 | Monday, January 3, 2022 | Credit | 110 | 1.5 | 1 | 110 | 1.5 | 1 | 1.5 | 1.5 | 1/3/2022 |
8 | Processed | VJ300320609 | GuttenPlans | GuttenPlans x1367 | Monster Energy Original | Carbonated | 14519670154 | Monday, January 3, 2022 | Credit | 144 | 3.0 | 1 | 144 | 3.0 | 1 | 3.0 | 3.0 | 1/3/2022 |
9 | Processed | VJ300320686 | Earle Asphalt | Earle Asphalt x1371 | Seapoint Farms Dry Roasted Edamame - Wasabi | Food | 14520315330 | Monday, January 3, 2022 | Credit | 134 | 2.5 | 1 | 134 | 2.5 | 1 | 2.5 | 2.5 | 1/3/2022 |
Here is an example of how different data types have different functionality. We can check an elements data type by using the built-in Python function type
. (Python has relatively few built-in functions, definitely many fewer than Mathematica, for example. Often the functions we use in Math 10 will come from an external library. For example, the pd.read_csv
function is a function defined in the pandas library.)
name = "Chris"
type(name)
str
The data type of df
is a type defined in pandas
. The following says pandas.core.frame.DataFrame
. I usually ignore the middle terms and only focus on the first and last term. The first term is telling us that this is defined in pandas, and the last term is telling us that the type
of df
is DataFrame
.
type(df)
pandas.core.frame.DataFrame
As mentioned above, different types of objects have different functionality. The following are examples of all the attributes and methods of strings. (Technically methods are themselves a type of attribute.) We’ll see examples of how to use these attributes and methods soon. I recommend mostly ignoring the ones that begin with two underscores, like __add__
; those are mostly just used in the background by Python.
dir(name)
['__add__',
'__class__',
'__contains__',
'__delattr__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__getitem__',
'__getnewargs__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__iter__',
'__le__',
'__len__',
'__lt__',
'__mod__',
'__mul__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__rmod__',
'__rmul__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'capitalize',
'casefold',
'center',
'count',
'encode',
'endswith',
'expandtabs',
'find',
'format',
'format_map',
'index',
'isalnum',
'isalpha',
'isascii',
'isdecimal',
'isdigit',
'isidentifier',
'islower',
'isnumeric',
'isprintable',
'isspace',
'istitle',
'isupper',
'join',
'ljust',
'lower',
'lstrip',
'maketrans',
'partition',
'removeprefix',
'removesuffix',
'replace',
'rfind',
'rindex',
'rjust',
'rpartition',
'rsplit',
'rstrip',
'split',
'splitlines',
'startswith',
'strip',
'swapcase',
'title',
'translate',
'upper',
'zfill']
In retrospect, the following was a bad example, because "Chris"
was already capitalized.
# using the capitalize method
name.capitalize()
'Chris'
name
'Chris'
Here is a better example. We call the upper
method of name
to convert the string to all upper-case letters.
name.upper()
'CHRIS'
If we try to call the upper
method on df
, we get an error, because DataFrames do not have an upper
method. This is an example of how knowing the data type that you’re working with is important, because once you know what the data type is, you also know the special functionality you can access.
df.upper()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [17], line 1
----> 1 df.upper()
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:5465, in NDFrame.__getattr__(self, name)
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5464 return self[name]
-> 5465 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'upper'
Here is an example of an attribute, as opposed to a method. Methods are like functions, and attributes are like variables. That’s not a perfect description; you will get more familiar with attributes and methods, and get more intuition about whether something is a method or attribute, as you work more in Python. This particular attribute records the number of rows and the number of columns of df
.
# attribute, not a method
df.shape
(6445, 18)
The value corresponding to df.shape
is what is called a tuple. At first glance, tuples are very similar to lists (which showed up in Worksheet 0). We will discuss some differences later.
type(df.shape)
tuple
Here is another very useful attribute of a DataFrame, the columns
attribute. This tells us all the different column names. (I’m being careful to not say it’s a list of the column names, because it is not a list, nor is it a tuple…)
df.columns
Index(['Status', 'Device ID', 'Location', 'Machine', 'Product', 'Category',
'Transaction', 'TransDate', 'Type', 'RCoil', 'RPrice', 'RQty', 'MCoil',
'MPrice', 'MQty', 'LineTotal', 'TransTotal', 'Prcd Date'],
dtype='object')
To access a specific column from a pandas DataFrame, we can use the following syntax.
df["RPrice"]
0 3.5
1 3.5
2 1.5
3 1.5
4 3.5
...
6440 2.0
6441 2.0
6442 2.0
6443 2.5
6444 1.5
Name: RPrice, Length: 6445, dtype: float64
Columns in pandas DataFrames are represented by the type Series. The two most important data types in pandas are DataFrames and Series.
type(df["RPrice"])
pandas.core.series.Series
Indexing in Python starts at 0, so if we want to get the first element of something, like a pandas Series, I will usually call it the “zeroth” element instead of the “first” element. This syntax looks a little strange at first, but to get the zeroth element in the pandas Series df["RPrice"]
, we use .iloc[0]
.
df["RPrice"].iloc[0]
3.5
The type of this value is given as numpy.float64
. NumPy is another Python library (like pandas), and because NumPy is a dependency of pandas, any system where pandas works should also have NumPy installed. This data type is defined in NumPy. The “float” in numpy.float64
is telling us that these are decimals (as opposed to integers). The “64” is specifying how much space the numbers take up; the “64” won’t be very important for us in Math 10.
The following code is longer than what we were writing above. If it doesn’t make sense, try to break it up into separate pieces. In this case, to understand the full code, you should first understand df["RPrice"]
, then you should understand df["RPrice"].iloc[0]
, and once you understand that, the full code should make sense, type(df["RPrice"].iloc[0])
.
type(df["RPrice"].iloc[0])
numpy.float64
The list of attributes and methods of a pandas Series is quite a bit longer than the list for strings that we saw above.
dir(df["RPrice"])
['T',
'_AXIS_LEN',
'_AXIS_ORDERS',
'_AXIS_REVERSED',
'_AXIS_TO_AXIS_NUMBER',
'_HANDLED_TYPES',
'__abs__',
'__add__',
'__and__',
'__annotations__',
'__array__',
'__array_priority__',
'__array_ufunc__',
'__array_wrap__',
'__bool__',
'__class__',
'__contains__',
'__copy__',
'__deepcopy__',
'__delattr__',
'__delitem__',
'__dict__',
'__dir__',
'__divmod__',
'__doc__',
'__eq__',
'__finalize__',
'__float__',
'__floordiv__',
'__format__',
'__ge__',
'__getattr__',
'__getattribute__',
'__getitem__',
'__getstate__',
'__gt__',
'__hash__',
'__iadd__',
'__iand__',
'__ifloordiv__',
'__imod__',
'__imul__',
'__init__',
'__init_subclass__',
'__int__',
'__invert__',
'__ior__',
'__ipow__',
'__isub__',
'__iter__',
'__itruediv__',
'__ixor__',
'__le__',
'__len__',
'__long__',
'__lt__',
'__matmul__',
'__mod__',
'__module__',
'__mul__',
'__ne__',
'__neg__',
'__new__',
'__nonzero__',
'__or__',
'__pos__',
'__pow__',
'__radd__',
'__rand__',
'__rdivmod__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__rfloordiv__',
'__rmatmul__',
'__rmod__',
'__rmul__',
'__ror__',
'__round__',
'__rpow__',
'__rsub__',
'__rtruediv__',
'__rxor__',
'__setattr__',
'__setitem__',
'__setstate__',
'__sizeof__',
'__str__',
'__sub__',
'__subclasshook__',
'__truediv__',
'__weakref__',
'__xor__',
'_accessors',
'_accum_func',
'_add_numeric_operations',
'_agg_by_level',
'_agg_examples_doc',
'_agg_see_also_doc',
'_align_frame',
'_align_series',
'_arith_method',
'_attrs',
'_binop',
'_builtin_table',
'_cacher',
'_can_hold_na',
'_check_inplace_and_allows_duplicate_labels',
'_check_inplace_setting',
'_check_is_chained_assignment_possible',
'_check_label_or_level_ambiguity',
'_check_setitem_copy',
'_clear_item_cache',
'_clip_with_one_bound',
'_clip_with_scalar',
'_cmp_method',
'_consolidate',
'_consolidate_inplace',
'_construct_axes_dict',
'_construct_axes_from_arguments',
'_construct_result',
'_constructor',
'_constructor_expanddim',
'_constructor_sliced',
'_convert',
'_convert_dtypes',
'_cython_table',
'_data',
'_dir_additions',
'_dir_deletions',
'_drop_axis',
'_drop_labels_or_levels',
'_find_valid_index',
'_flags',
'_get_axis',
'_get_axis_name',
'_get_axis_number',
'_get_axis_resolvers',
'_get_block_manager_axis',
'_get_bool_data',
'_get_cacher',
'_get_cleaned_column_resolvers',
'_get_cython_func',
'_get_index_resolvers',
'_get_item_cache',
'_get_label_or_level_values',
'_get_numeric_data',
'_get_value',
'_get_values',
'_get_values_tuple',
'_get_with',
'_gotitem',
'_hidden_attrs',
'_index',
'_indexed_same',
'_info_axis',
'_info_axis_name',
'_info_axis_number',
'_init_dict',
'_init_mgr',
'_inplace_method',
'_internal_names',
'_internal_names_set',
'_is_builtin_func',
'_is_cached',
'_is_copy',
'_is_label_or_level_reference',
'_is_label_reference',
'_is_level_reference',
'_is_mixed_type',
'_is_view',
'_iset_item',
'_item_cache',
'_ix',
'_ixs',
'_logical_func',
'_logical_method',
'_map_values',
'_maybe_cache_changed',
'_maybe_update_cacher',
'_metadata',
'_mgr',
'_min_count_stat_function',
'_name',
'_needs_reindex_multi',
'_obj_with_exclusions',
'_protect_consolidate',
'_reduce',
'_reindex_axes',
'_reindex_indexer',
'_reindex_multi',
'_reindex_with_indexers',
'_replace_single',
'_repr_data_resource_',
'_repr_latex_',
'_reset_cache',
'_reset_cacher',
'_selected_obj',
'_selection',
'_selection_list',
'_selection_name',
'_set_as_cached',
'_set_axis',
'_set_axis_name',
'_set_axis_nocheck',
'_set_is_copy',
'_set_item',
'_set_labels',
'_set_name',
'_set_value',
'_set_values',
'_set_with',
'_set_with_engine',
'_slice',
'_stat_axis',
'_stat_axis_name',
'_stat_axis_number',
'_stat_function',
'_stat_function_ddof',
'_take_with_is_copy',
'_to_dict_of_blocks',
'_try_aggregate_string_function',
'_typ',
'_update_inplace',
'_validate_dtype',
'_values',
'_where',
'abs',
'add',
'add_prefix',
'add_suffix',
'agg',
'aggregate',
'align',
'all',
'any',
'append',
'apply',
'argmax',
'argmin',
'argsort',
'array',
'asfreq',
'asof',
'astype',
'at',
'at_time',
'attrs',
'autocorr',
'axes',
'backfill',
'between',
'between_time',
'bfill',
'bool',
'clip',
'combine',
'combine_first',
'compare',
'convert_dtypes',
'copy',
'corr',
'count',
'cov',
'cummax',
'cummin',
'cumprod',
'cumsum',
'describe',
'diff',
'div',
'divide',
'divmod',
'dot',
'drop',
'drop_duplicates',
'droplevel',
'dropna',
'dtype',
'dtypes',
'duplicated',
'empty',
'eq',
'equals',
'ewm',
'expanding',
'explode',
'factorize',
'ffill',
'fillna',
'filter',
'first',
'first_valid_index',
'flags',
'floordiv',
'ge',
'get',
'groupby',
'gt',
'hasnans',
'head',
'hist',
'iat',
'idxmax',
'idxmin',
'iloc',
'index',
'infer_objects',
'interpolate',
'is_monotonic',
'is_monotonic_decreasing',
'is_monotonic_increasing',
'is_unique',
'isin',
'isna',
'isnull',
'item',
'items',
'iteritems',
'keys',
'kurt',
'kurtosis',
'last',
'last_valid_index',
'le',
'loc',
'lt',
'mad',
'map',
'mask',
'max',
'mean',
'median',
'memory_usage',
'min',
'mod',
'mode',
'mul',
'multiply',
'name',
'nbytes',
'ndim',
'ne',
'nlargest',
'notna',
'notnull',
'nsmallest',
'nunique',
'pad',
'pct_change',
'pipe',
'plot',
'pop',
'pow',
'prod',
'product',
'quantile',
'radd',
'rank',
'ravel',
'rdiv',
'rdivmod',
'reindex',
'reindex_like',
'rename',
'rename_axis',
'reorder_levels',
'repeat',
'replace',
'resample',
'reset_index',
'rfloordiv',
'rmod',
'rmul',
'rolling',
'round',
'rpow',
'rsub',
'rtruediv',
'sample',
'searchsorted',
'sem',
'set_axis',
'set_flags',
'shape',
'shift',
'size',
'skew',
'slice_shift',
'sort_index',
'sort_values',
'squeeze',
'std',
'sub',
'subtract',
'sum',
'swapaxes',
'swaplevel',
'tail',
'take',
'to_clipboard',
'to_csv',
'to_dict',
'to_excel',
'to_frame',
'to_hdf',
'to_json',
'to_latex',
'to_list',
'to_markdown',
'to_numpy',
'to_period',
'to_pickle',
'to_sql',
'to_string',
'to_timestamp',
'to_xarray',
'transform',
'transpose',
'truediv',
'truncate',
'tz_convert',
'tz_localize',
'unique',
'unstack',
'update',
'value_counts',
'values',
'var',
'view',
'where',
'xs']
Notice that there is a dtype
attribute. That gives us easier access to the data type of the contents of the pandas Series.
df["RPrice"].dtype
dtype('float64')