Week 0 Friday#

Starting in Week 1, we will have more specific topics. Today, we went through a variety of basic topics, especially related to the type of an object in Python.

We start by importing pandas. The pandas library is the most important library for Math 10. (The second-most important library is probably scikit-learn, which we will use extensively in the Machine Learning portion of Math 10.)

In theory, you could give pandas an abbreviation other than pd, or not use any abbreviation at all, but in practice, everyone uses pd, and we will also always use pd.

import pandas as pd

Just to emphasize that pd is now defined but pandas is not, because of our import statement.

pandas
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [4], line 1
----> 1 pandas

NameError: name 'pandas' is not defined

I use the terms “module” and “library” moreorless interchangeably. I usually refer to pandas as a “library”; here Python is referring to it as a “module”.

pd
<module 'pandas' from '/shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/__init__.py'>

One of the most important concepts at the beginning of Math 10 is the concept of the type of an object in Python. Different types of objects have different functionality associated with them. To use the read_csv function defined by pandas, we need to use as an argument an element with the type of string. Here we forget to use quotation marks (i.e., we forgot to turn vend.csv into a string), so that’s why we get an error.

pd.read_csv(vend.csv)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [6], line 1
----> 1 pd.read_csv(vend.csv)

NameError: name 'vend' is not defined

Instead of using vend.csv as our argument, we use "vend.csv". This "vend.csv" is a string.

pd.read_csv("vend.csv")
Status Device ID Location Machine Product Category Transaction TransDate Type RCoil RPrice RQty MCoil MPrice MQty LineTotal TransTotal Prcd Date
0 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14515778905 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
1 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516018629 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 5.0 1/1/2022
2 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516018629 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 5.0 1/1/2022
3 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516020373 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 1.5 1/1/2022
4 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516021756 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6440 Processed VJ300320692 EB Public Library EB Public Library x1380 Lindens - Chocolate Chippers Food 15603201222 Wednesday, August 31, 2022 Credit 122 2.0 1 122 2.0 1 2.0 6.0 8/31/2022
6441 Processed VJ300320692 EB Public Library EB Public Library x1380 Wonderful Pistachios - Variety Food 15603201222 Wednesday, August 31, 2022 Credit 131 2.0 1 131 2.0 1 2.0 6.0 8/31/2022
6442 Processed VJ300320692 EB Public Library EB Public Library x1380 Hungry Buddha - Chocolate Chip Food 15603201222 Wednesday, August 31, 2022 Credit 137 2.0 1 137 2.0 1 2.0 6.0 8/31/2022
6443 Processed VJ300320609 GuttenPlans GuttenPlans x1367 Snapple Tea - Lemon Non Carbonated 15603853105 Wednesday, August 31, 2022 Credit 145 2.5 1 145 2.5 1 2.5 2.5 8/31/2022
6444 Processed VJ300320692 EB Public Library EB Public Library x1380 Goldfish Baked - Cheddar Food 15603921383 Wednesday, August 31, 2022 Cash 125 1.5 1 125 1.5 1 1.5 1.5 8/31/2022

6445 rows × 18 columns

To later access the contents of this dataset, we should store it in some variable name. A good default choice is df.

df = pd.read_csv("vend.csv")

In the worksheet from yesterday, we only did a few things with this dataset. One thing we did was to look at its first 10 rows using the head method.

df.head(10)
Status Device ID Location Machine Product Category Transaction TransDate Type RCoil RPrice RQty MCoil MPrice MQty LineTotal TransTotal Prcd Date
0 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14515778905 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
1 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516018629 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 5.0 1/1/2022
2 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516018629 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 5.0 1/1/2022
3 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516020373 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 1.5 1/1/2022
4 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516021756 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
5 Processed VJ300205292 Brunswick Sq Mall BSQ Mall x1364 - Zales Bai Antioxidant - Brasilia BB Non Carbonated 14517568743 Sunday, January 2, 2022 Cash 146 2.5 1 146 2.5 1 2.5 2.5 1/2/2022
6 Processed VJ300205292 Brunswick Sq Mall BSQ Mall x1364 - Zales Miss Vickie's Potato Chip - Sea Salt & Vinega Food 14518731524 Monday, January 3, 2022 Cash 114 1.5 1 114 1.5 1 1.5 1.5 1/2/2022
7 Processed VJ300320686 Earle Asphalt Earle Asphalt x1371 Miss Vickie's Potato Chip - Lime & Cracked Pe Food 14519162059 Monday, January 3, 2022 Credit 110 1.5 1 110 1.5 1 1.5 1.5 1/3/2022
8 Processed VJ300320609 GuttenPlans GuttenPlans x1367 Monster Energy Original Carbonated 14519670154 Monday, January 3, 2022 Credit 144 3.0 1 144 3.0 1 3.0 3.0 1/3/2022
9 Processed VJ300320686 Earle Asphalt Earle Asphalt x1371 Seapoint Farms Dry Roasted Edamame - Wasabi Food 14520315330 Monday, January 3, 2022 Credit 134 2.5 1 134 2.5 1 2.5 2.5 1/3/2022

Here is an example of how different data types have different functionality. We can check an elements data type by using the built-in Python function type. (Python has relatively few built-in functions, definitely many fewer than Mathematica, for example. Often the functions we use in Math 10 will come from an external library. For example, the pd.read_csv function is a function defined in the pandas library.)

name = "Chris"
type(name)
str

The data type of df is a type defined in pandas. The following says pandas.core.frame.DataFrame. I usually ignore the middle terms and only focus on the first and last term. The first term is telling us that this is defined in pandas, and the last term is telling us that the type of df is DataFrame.

type(df)
pandas.core.frame.DataFrame

As mentioned above, different types of objects have different functionality. The following are examples of all the attributes and methods of strings. (Technically methods are themselves a type of attribute.) We’ll see examples of how to use these attributes and methods soon. I recommend mostly ignoring the ones that begin with two underscores, like __add__; those are mostly just used in the background by Python.

dir(name)
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

In retrospect, the following was a bad example, because "Chris" was already capitalized.

# using the capitalize method
name.capitalize()
'Chris'
name
'Chris'

Here is a better example. We call the upper method of name to convert the string to all upper-case letters.

name.upper()
'CHRIS'

If we try to call the upper method on df, we get an error, because DataFrames do not have an upper method. This is an example of how knowing the data type that you’re working with is important, because once you know what the data type is, you also know the special functionality you can access.

df.upper()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [17], line 1
----> 1 df.upper()

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:5465, in NDFrame.__getattr__(self, name)
   5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5464     return self[name]
-> 5465 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'upper'

Here is an example of an attribute, as opposed to a method. Methods are like functions, and attributes are like variables. That’s not a perfect description; you will get more familiar with attributes and methods, and get more intuition about whether something is a method or attribute, as you work more in Python. This particular attribute records the number of rows and the number of columns of df.

# attribute, not a method
df.shape
(6445, 18)

The value corresponding to df.shape is what is called a tuple. At first glance, tuples are very similar to lists (which showed up in Worksheet 0). We will discuss some differences later.

type(df.shape)
tuple

Here is another very useful attribute of a DataFrame, the columns attribute. This tells us all the different column names. (I’m being careful to not say it’s a list of the column names, because it is not a list, nor is it a tuple…)

df.columns
Index(['Status', 'Device ID', 'Location', 'Machine', 'Product', 'Category',
       'Transaction', 'TransDate', 'Type', 'RCoil', 'RPrice', 'RQty', 'MCoil',
       'MPrice', 'MQty', 'LineTotal', 'TransTotal', 'Prcd Date'],
      dtype='object')

To access a specific column from a pandas DataFrame, we can use the following syntax.

df["RPrice"]
0       3.5
1       3.5
2       1.5
3       1.5
4       3.5
       ... 
6440    2.0
6441    2.0
6442    2.0
6443    2.5
6444    1.5
Name: RPrice, Length: 6445, dtype: float64

Columns in pandas DataFrames are represented by the type Series. The two most important data types in pandas are DataFrames and Series.

type(df["RPrice"])
pandas.core.series.Series

Indexing in Python starts at 0, so if we want to get the first element of something, like a pandas Series, I will usually call it the “zeroth” element instead of the “first” element. This syntax looks a little strange at first, but to get the zeroth element in the pandas Series df["RPrice"], we use .iloc[0].

df["RPrice"].iloc[0]
3.5

The type of this value is given as numpy.float64. NumPy is another Python library (like pandas), and because NumPy is a dependency of pandas, any system where pandas works should also have NumPy installed. This data type is defined in NumPy. The “float” in numpy.float64 is telling us that these are decimals (as opposed to integers). The “64” is specifying how much space the numbers take up; the “64” won’t be very important for us in Math 10.

The following code is longer than what we were writing above. If it doesn’t make sense, try to break it up into separate pieces. In this case, to understand the full code, you should first understand df["RPrice"], then you should understand df["RPrice"].iloc[0], and once you understand that, the full code should make sense, type(df["RPrice"].iloc[0]).

type(df["RPrice"].iloc[0])
numpy.float64

The list of attributes and methods of a pandas Series is quite a bit longer than the list for strings that we saw above.

dir(df["RPrice"])
['T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__long__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmatmul__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__weakref__',
 '__xor__',
 '_accessors',
 '_accum_func',
 '_add_numeric_operations',
 '_agg_by_level',
 '_agg_examples_doc',
 '_agg_see_also_doc',
 '_align_frame',
 '_align_series',
 '_arith_method',
 '_attrs',
 '_binop',
 '_builtin_table',
 '_cacher',
 '_can_hold_na',
 '_check_inplace_and_allows_duplicate_labels',
 '_check_inplace_setting',
 '_check_is_chained_assignment_possible',
 '_check_label_or_level_ambiguity',
 '_check_setitem_copy',
 '_clear_item_cache',
 '_clip_with_one_bound',
 '_clip_with_scalar',
 '_cmp_method',
 '_consolidate',
 '_consolidate_inplace',
 '_construct_axes_dict',
 '_construct_axes_from_arguments',
 '_construct_result',
 '_constructor',
 '_constructor_expanddim',
 '_constructor_sliced',
 '_convert',
 '_convert_dtypes',
 '_cython_table',
 '_data',
 '_dir_additions',
 '_dir_deletions',
 '_drop_axis',
 '_drop_labels_or_levels',
 '_find_valid_index',
 '_flags',
 '_get_axis',
 '_get_axis_name',
 '_get_axis_number',
 '_get_axis_resolvers',
 '_get_block_manager_axis',
 '_get_bool_data',
 '_get_cacher',
 '_get_cleaned_column_resolvers',
 '_get_cython_func',
 '_get_index_resolvers',
 '_get_item_cache',
 '_get_label_or_level_values',
 '_get_numeric_data',
 '_get_value',
 '_get_values',
 '_get_values_tuple',
 '_get_with',
 '_gotitem',
 '_hidden_attrs',
 '_index',
 '_indexed_same',
 '_info_axis',
 '_info_axis_name',
 '_info_axis_number',
 '_init_dict',
 '_init_mgr',
 '_inplace_method',
 '_internal_names',
 '_internal_names_set',
 '_is_builtin_func',
 '_is_cached',
 '_is_copy',
 '_is_label_or_level_reference',
 '_is_label_reference',
 '_is_level_reference',
 '_is_mixed_type',
 '_is_view',
 '_iset_item',
 '_item_cache',
 '_ix',
 '_ixs',
 '_logical_func',
 '_logical_method',
 '_map_values',
 '_maybe_cache_changed',
 '_maybe_update_cacher',
 '_metadata',
 '_mgr',
 '_min_count_stat_function',
 '_name',
 '_needs_reindex_multi',
 '_obj_with_exclusions',
 '_protect_consolidate',
 '_reduce',
 '_reindex_axes',
 '_reindex_indexer',
 '_reindex_multi',
 '_reindex_with_indexers',
 '_replace_single',
 '_repr_data_resource_',
 '_repr_latex_',
 '_reset_cache',
 '_reset_cacher',
 '_selected_obj',
 '_selection',
 '_selection_list',
 '_selection_name',
 '_set_as_cached',
 '_set_axis',
 '_set_axis_name',
 '_set_axis_nocheck',
 '_set_is_copy',
 '_set_item',
 '_set_labels',
 '_set_name',
 '_set_value',
 '_set_values',
 '_set_with',
 '_set_with_engine',
 '_slice',
 '_stat_axis',
 '_stat_axis_name',
 '_stat_axis_number',
 '_stat_function',
 '_stat_function_ddof',
 '_take_with_is_copy',
 '_to_dict_of_blocks',
 '_try_aggregate_string_function',
 '_typ',
 '_update_inplace',
 '_validate_dtype',
 '_values',
 '_where',
 'abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'append',
 'apply',
 'argmax',
 'argmin',
 'argsort',
 'array',
 'asfreq',
 'asof',
 'astype',
 'at',
 'at_time',
 'attrs',
 'autocorr',
 'axes',
 'backfill',
 'between',
 'between_time',
 'bfill',
 'bool',
 'clip',
 'combine',
 'combine_first',
 'compare',
 'convert_dtypes',
 'copy',
 'corr',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'describe',
 'diff',
 'div',
 'divide',
 'divmod',
 'dot',
 'drop',
 'drop_duplicates',
 'droplevel',
 'dropna',
 'dtype',
 'dtypes',
 'duplicated',
 'empty',
 'eq',
 'equals',
 'ewm',
 'expanding',
 'explode',
 'factorize',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'flags',
 'floordiv',
 'ge',
 'get',
 'groupby',
 'gt',
 'hasnans',
 'head',
 'hist',
 'iat',
 'idxmax',
 'idxmin',
 'iloc',
 'index',
 'infer_objects',
 'interpolate',
 'is_monotonic',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'isin',
 'isna',
 'isnull',
 'item',
 'items',
 'iteritems',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'le',
 'loc',
 'lt',
 'mad',
 'map',
 'mask',
 'max',
 'mean',
 'median',
 'memory_usage',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'name',
 'nbytes',
 'ndim',
 'ne',
 'nlargest',
 'notna',
 'notnull',
 'nsmallest',
 'nunique',
 'pad',
 'pct_change',
 'pipe',
 'plot',
 'pop',
 'pow',
 'prod',
 'product',
 'quantile',
 'radd',
 'rank',
 'ravel',
 'rdiv',
 'rdivmod',
 'reindex',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'repeat',
 'replace',
 'resample',
 'reset_index',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'searchsorted',
 'sem',
 'set_axis',
 'set_flags',
 'shape',
 'shift',
 'size',
 'skew',
 'slice_shift',
 'sort_index',
 'sort_values',
 'squeeze',
 'std',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'to_clipboard',
 'to_csv',
 'to_dict',
 'to_excel',
 'to_frame',
 'to_hdf',
 'to_json',
 'to_latex',
 'to_list',
 'to_markdown',
 'to_numpy',
 'to_period',
 'to_pickle',
 'to_sql',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tz_convert',
 'tz_localize',
 'unique',
 'unstack',
 'update',
 'value_counts',
 'values',
 'var',
 'view',
 'where',
 'xs']

Notice that there is a dtype attribute. That gives us easier access to the data type of the contents of the pandas Series.

df["RPrice"].dtype
dtype('float64')
Created in deepnote.com Created in Deepnote