Week 0 Friday#

Starting in Week 1, we will have more specific topics. Today, we went through a variety of basic topics, especially related to the type of an object in Python.

We start by importing pandas. The pandas library is the most important library for Math 10. (The second-most important library is probably scikit-learn, which we will use extensively in the Machine Learning portion of Math 10.)

In theory, you could give pandas an abbreviation other than pd, or not use any abbreviation at all, but in practice, everyone uses pd, and we will also always use pd.

import pandas as pd

Just to emphasize that pd is now defined but pandas is not, because of our import statement.

pandas

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [4], line 1
----> 1 pandas

NameError: name 'pandas' is not defined

I use the terms “module” and “library” moreorless interchangeably. I usually refer to pandas as a “library”; here Python is referring to it as a “module”.

pd

<module 'pandas' from '/shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/__init__.py'>

One of the most important concepts at the beginning of Math 10 is the concept of the type of an object in Python. Different types of objects have different functionality associated with them. To use the read_csv function defined by pandas, we need to use as an argument an element with the type of string. Here we forget to use quotation marks (i.e., we forgot to turn vend.csv into a string), so that’s why we get an error.

pd.read_csv(vend.csv)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [6], line 1
----> 1 pd.read_csv(vend.csv)

NameError: name 'vend' is not defined

Instead of using vend.csv as our argument, we use "vend.csv". This "vend.csv" is a string.

pd.read_csv("vend.csv")

	Status	Device ID	Location	Machine	Product	Category	Transaction	TransDate	Type	RCoil	RPrice	RQty	MCoil	MPrice	MQty	LineTotal	TransTotal	Prcd Date
0	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14515778905	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	3.5	1/1/2022
1	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14516018629	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	5.0	1/1/2022
2	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Takis - Hot Chilli Pepper & Lime	Food	14516018629	Saturday, January 1, 2022	Credit	123	1.5	1	123	1.5	1	1.5	5.0	1/1/2022
3	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Takis - Hot Chilli Pepper & Lime	Food	14516020373	Saturday, January 1, 2022	Credit	123	1.5	1	123	1.5	1	1.5	1.5	1/1/2022
4	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14516021756	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	3.5	1/1/2022
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6440	Processed	VJ300320692	EB Public Library	EB Public Library x1380	Lindens - Chocolate Chippers	Food	15603201222	Wednesday, August 31, 2022	Credit	122	2.0	1	122	2.0	1	2.0	6.0	8/31/2022
6441	Processed	VJ300320692	EB Public Library	EB Public Library x1380	Wonderful Pistachios - Variety	Food	15603201222	Wednesday, August 31, 2022	Credit	131	2.0	1	131	2.0	1	2.0	6.0	8/31/2022
6442	Processed	VJ300320692	EB Public Library	EB Public Library x1380	Hungry Buddha - Chocolate Chip	Food	15603201222	Wednesday, August 31, 2022	Credit	137	2.0	1	137	2.0	1	2.0	6.0	8/31/2022
6443	Processed	VJ300320609	GuttenPlans	GuttenPlans x1367	Snapple Tea - Lemon	Non Carbonated	15603853105	Wednesday, August 31, 2022	Credit	145	2.5	1	145	2.5	1	2.5	2.5	8/31/2022
6444	Processed	VJ300320692	EB Public Library	EB Public Library x1380	Goldfish Baked - Cheddar	Food	15603921383	Wednesday, August 31, 2022	Cash	125	1.5	1	125	1.5	1	1.5	1.5	8/31/2022

6445 rows × 18 columns

To later access the contents of this dataset, we should store it in some variable name. A good default choice is df.

df = pd.read_csv("vend.csv")

In the worksheet from yesterday, we only did a few things with this dataset. One thing we did was to look at its first 10 rows using the head method.

df.head(10)

	Status	Device ID	Location	Machine	Product	Category	Transaction	TransDate	Type	RCoil	RPrice	RQty	MCoil	MPrice	MQty	LineTotal	TransTotal	Prcd Date
0	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14515778905	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	3.5	1/1/2022
1	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14516018629	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	5.0	1/1/2022
2	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Takis - Hot Chilli Pepper & Lime	Food	14516018629	Saturday, January 1, 2022	Credit	123	1.5	1	123	1.5	1	1.5	5.0	1/1/2022
3	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Takis - Hot Chilli Pepper & Lime	Food	14516020373	Saturday, January 1, 2022	Credit	123	1.5	1	123	1.5	1	1.5	1.5	1/1/2022
4	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14516021756	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	3.5	1/1/2022
5	Processed	VJ300205292	Brunswick Sq Mall	BSQ Mall x1364 - Zales	Bai Antioxidant - Brasilia BB	Non Carbonated	14517568743	Sunday, January 2, 2022	Cash	146	2.5	1	146	2.5	1	2.5	2.5	1/2/2022
6	Processed	VJ300205292	Brunswick Sq Mall	BSQ Mall x1364 - Zales	Miss Vickie's Potato Chip - Sea Salt & Vinega	Food	14518731524	Monday, January 3, 2022	Cash	114	1.5	1	114	1.5	1	1.5	1.5	1/2/2022
7	Processed	VJ300320686	Earle Asphalt	Earle Asphalt x1371	Miss Vickie's Potato Chip - Lime & Cracked Pe	Food	14519162059	Monday, January 3, 2022	Credit	110	1.5	1	110	1.5	1	1.5	1.5	1/3/2022
8	Processed	VJ300320609	GuttenPlans	GuttenPlans x1367	Monster Energy Original	Carbonated	14519670154	Monday, January 3, 2022	Credit	144	3.0	1	144	3.0	1	3.0	3.0	1/3/2022
9	Processed	VJ300320686	Earle Asphalt	Earle Asphalt x1371	Seapoint Farms Dry Roasted Edamame - Wasabi	Food	14520315330	Monday, January 3, 2022	Credit	134	2.5	1	134	2.5	1	2.5	2.5	1/3/2022

Here is an example of how different data types have different functionality. We can check an elements data type by using the built-in Python function type. (Python has relatively few built-in functions, definitely many fewer than Mathematica, for example. Often the functions we use in Math 10 will come from an external library. For example, the pd.read_csv function is a function defined in the pandas library.)

name = "Chris"

type(name)

str

The data type of df is a type defined in pandas. The following says pandas.core.frame.DataFrame. I usually ignore the middle terms and only focus on the first and last term. The first term is telling us that this is defined in pandas, and the last term is telling us that the type of df is DataFrame.

type(df)

pandas.core.frame.DataFrame

As mentioned above, different types of objects have different functionality. The following are examples of all the attributes and methods of strings. (Technically methods are themselves a type of attribute.) We’ll see examples of how to use these attributes and methods soon. I recommend mostly ignoring the ones that begin with two underscores, like __add__; those are mostly just used in the background by Python.

dir(name)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

In retrospect, the following was a bad example, because "Chris" was already capitalized.

# using the capitalize method
name.capitalize()

'Chris'

name

'Chris'

Here is a better example. We call the upper method of name to convert the string to all upper-case letters.

name.upper()

'CHRIS'

If we try to call the upper method on df, we get an error, because DataFrames do not have an upper method. This is an example of how knowing the data type that you’re working with is important, because once you know what the data type is, you also know the special functionality you can access.

df.upper()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [17], line 1
----> 1 df.upper()

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:5465, in NDFrame.__getattr__(self, name)
   5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5464     return self[name]
-> 5465 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'upper'

Here is an example of an attribute, as opposed to a method. Methods are like functions, and attributes are like variables. That’s not a perfect description; you will get more familiar with attributes and methods, and get more intuition about whether something is a method or attribute, as you work more in Python. This particular attribute records the number of rows and the number of columns of df.

# attribute, not a method
df.shape

(6445, 18)

The value corresponding to df.shape is what is called a tuple. At first glance, tuples are very similar to lists (which showed up in Worksheet 0). We will discuss some differences later.

type(df.shape)

tuple

Here is another very useful attribute of a DataFrame, the columns attribute. This tells us all the different column names. (I’m being careful to not say it’s a list of the column names, because it is not a list, nor is it a tuple…)

df.columns

Index(['Status', 'Device ID', 'Location', 'Machine', 'Product', 'Category',
       'Transaction', 'TransDate', 'Type', 'RCoil', 'RPrice', 'RQty', 'MCoil',
       'MPrice', 'MQty', 'LineTotal', 'TransTotal', 'Prcd Date'],
      dtype='object')

To access a specific column from a pandas DataFrame, we can use the following syntax.

df["RPrice"]

     3.5
     3.5
     1.5
     1.5
     3.5
       ... 
  2.0
  2.0
  2.0
  2.5
  1.5
Name: RPrice, Length: 6445, dtype: float64

Columns in pandas DataFrames are represented by the type Series. The two most important data types in pandas are DataFrames and Series.

type(df["RPrice"])

pandas.core.series.Series

Indexing in Python starts at 0, so if we want to get the first element of something, like a pandas Series, I will usually call it the “zeroth” element instead of the “first” element. This syntax looks a little strange at first, but to get the zeroth element in the pandas Series df["RPrice"], we use .iloc[0].

df["RPrice"].iloc[0]

3.5

The type of this value is given as numpy.float64. NumPy is another Python library (like pandas), and because NumPy is a dependency of pandas, any system where pandas works should also have NumPy installed. This data type is defined in NumPy. The “float” in numpy.float64 is telling us that these are decimals (as opposed to integers). The “64” is specifying how much space the numbers take up; the “64” won’t be very important for us in Math 10.

The following code is longer than what we were writing above. If it doesn’t make sense, try to break it up into separate pieces. In this case, to understand the full code, you should first understand df["RPrice"], then you should understand df["RPrice"].iloc[0], and once you understand that, the full code should make sense, type(df["RPrice"].iloc[0]).

type(df["RPrice"].iloc[0])

numpy.float64

The list of attributes and methods of a pandas Series is quite a bit longer than the list for strings that we saw above.

dir(df["RPrice"])

['T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__long__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmatmul__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__weakref__',
 '__xor__',
 '_accessors',
 '_accum_func',
 '_add_numeric_operations',
 '_agg_by_level',
 '_agg_examples_doc',
 '_agg_see_also_doc',
 '_align_frame',
 '_align_series',
 '_arith_method',
 '_attrs',
 '_binop',
 '_builtin_table',
 '_cacher',
 '_can_hold_na',
 '_check_inplace_and_allows_duplicate_labels',
 '_check_inplace_setting',
 '_check_is_chained_assignment_possible',
 '_check_label_or_level_ambiguity',
 '_check_setitem_copy',
 '_clear_item_cache',
 '_clip_with_one_bound',
 '_clip_with_scalar',
 '_cmp_method',
 '_consolidate',
 '_consolidate_inplace',
 '_construct_axes_dict',
 '_construct_axes_from_arguments',
 '_construct_result',
 '_constructor',
 '_constructor_expanddim',
 '_constructor_sliced',
 '_convert',
 '_convert_dtypes',
 '_cython_table',
 '_data',
 '_dir_additions',
 '_dir_deletions',
 '_drop_axis',
 '_drop_labels_or_levels',
 '_find_valid_index',
 '_flags',
 '_get_axis',
 '_get_axis_name',
 '_get_axis_number',
 '_get_axis_resolvers',
 '_get_block_manager_axis',
 '_get_bool_data',
 '_get_cacher',
 '_get_cleaned_column_resolvers',
 '_get_cython_func',
 '_get_index_resolvers',
 '_get_item_cache',
 '_get_label_or_level_values',
 '_get_numeric_data',
 '_get_value',
 '_get_values',
 '_get_values_tuple',
 '_get_with',
 '_gotitem',
 '_hidden_attrs',
 '_index',
 '_indexed_same',
 '_info_axis',
 '_info_axis_name',
 '_info_axis_number',
 '_init_dict',
 '_init_mgr',
 '_inplace_method',
 '_internal_names',
 '_internal_names_set',
 '_is_builtin_func',
 '_is_cached',
 '_is_copy',
 '_is_label_or_level_reference',
 '_is_label_reference',
 '_is_level_reference',
 '_is_mixed_type',
 '_is_view',
 '_iset_item',
 '_item_cache',
 '_ix',
 '_ixs',
 '_logical_func',
 '_logical_method',
 '_map_values',
 '_maybe_cache_changed',
 '_maybe_update_cacher',
 '_metadata',
 '_mgr',
 '_min_count_stat_function',
 '_name',
 '_needs_reindex_multi',
 '_obj_with_exclusions',
 '_protect_consolidate',
 '_reduce',
 '_reindex_axes',
 '_reindex_indexer',
 '_reindex_multi',
 '_reindex_with_indexers',
 '_replace_single',
 '_repr_data_resource_',
 '_repr_latex_',
 '_reset_cache',
 '_reset_cacher',
 '_selected_obj',
 '_selection',
 '_selection_list',
 '_selection_name',
 '_set_as_cached',
 '_set_axis',
 '_set_axis_name',
 '_set_axis_nocheck',
 '_set_is_copy',
 '_set_item',
 '_set_labels',
 '_set_name',
 '_set_value',
 '_set_values',
 '_set_with',
 '_set_with_engine',
 '_slice',
 '_stat_axis',
 '_stat_axis_name',
 '_stat_axis_number',
 '_stat_function',
 '_stat_function_ddof',
 '_take_with_is_copy',
 '_to_dict_of_blocks',
 '_try_aggregate_string_function',
 '_typ',
 '_update_inplace',
 '_validate_dtype',
 '_values',
 '_where',
 'abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'append',
 'apply',
 'argmax',
 'argmin',
 'argsort',
 'array',
 'asfreq',
 'asof',
 'astype',
 'at',
 'at_time',
 'attrs',
 'autocorr',
 'axes',
 'backfill',
 'between',
 'between_time',
 'bfill',
 'bool',
 'clip',
 'combine',
 'combine_first',
 'compare',
 'convert_dtypes',
 'copy',
 'corr',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'describe',
 'diff',
 'div',
 'divide',
 'divmod',
 'dot',
 'drop',
 'drop_duplicates',
 'droplevel',
 'dropna',
 'dtype',
 'dtypes',
 'duplicated',
 'empty',
 'eq',
 'equals',
 'ewm',
 'expanding',
 'explode',
 'factorize',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'flags',
 'floordiv',
 'ge',
 'get',
 'groupby',
 'gt',
 'hasnans',
 'head',
 'hist',
 'iat',
 'idxmax',
 'idxmin',
 'iloc',
 'index',
 'infer_objects',
 'interpolate',
 'is_monotonic',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'isin',
 'isna',
 'isnull',
 'item',
 'items',
 'iteritems',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'le',
 'loc',
 'lt',
 'mad',
 'map',
 'mask',
 'max',
 'mean',
 'median',
 'memory_usage',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'name',
 'nbytes',
 'ndim',
 'ne',
 'nlargest',
 'notna',
 'notnull',
 'nsmallest',
 'nunique',
 'pad',
 'pct_change',
 'pipe',
 'plot',
 'pop',
 'pow',
 'prod',
 'product',
 'quantile',
 'radd',
 'rank',
 'ravel',
 'rdiv',
 'rdivmod',
 'reindex',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'repeat',
 'replace',
 'resample',
 'reset_index',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'searchsorted',
 'sem',
 'set_axis',
 'set_flags',
 'shape',
 'shift',
 'size',
 'skew',
 'slice_shift',
 'sort_index',
 'sort_values',
 'squeeze',
 'std',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'to_clipboard',
 'to_csv',
 'to_dict',
 'to_excel',
 'to_frame',
 'to_hdf',
 'to_json',
 'to_latex',
 'to_list',
 'to_markdown',
 'to_numpy',
 'to_period',
 'to_pickle',
 'to_sql',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tz_convert',
 'tz_localize',
 'unique',
 'unstack',
 'update',
 'value_counts',
 'values',
 'var',
 'view',
 'where',
 'xs']

Notice that there is a dtype attribute. That gives us easier access to the data type of the contents of the pandas Series.

df["RPrice"].dtype

dtype('float64')

Created in Deepnote

UC Irvine Math 10, Fall 2022

Week 0 Friday

Week 0 Friday#