Error on ages
Contents
Error on ages¶
Author: Remy Clement
Course Project, UC Irvine, Math 10, W22
Introduction¶
First I am going to clean up and dataframe about artist so that it is usable for scikit learn. The I will use k-neareast-neighbors regressor to predict artist death dates from there birth dates and and finding the error of the predictions. I will also see how changing k will affect the train and test error.
Error¶
from tqdm.std import tqdm, trange
from tqdm import notebook
notebook.tqdm = tqdm
notebook.trange = trange
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
from torch import nn
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.utils.data import DataLoader
import altair as alt
/shared-libs/python3.7/py/lib/python3.7/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
I need to first load a dataset
df=pd.read_csv('Artworks.csv')
df
Title | Artist | ConstituentID | ArtistBio | Nationality | BeginDate | EndDate | Gender | Date | Medium | ... | ThumbnailURL | Circumference (cm) | Depth (cm) | Diameter (cm) | Height (cm) | Length (cm) | Weight (kg) | Width (cm) | Seat Height (cm) | Duration (sec.) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ferdinandsbrücke Project, Vienna, Austria, Ele... | Otto Wagner | 6210 | (Austrian, 1841–1918) | (Austrian) | (1841) | (1918) | (Male) | 1896 | Ink and cut-and-pasted painted pages on paper | ... | http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s... | NaN | NaN | NaN | 48.6000 | NaN | NaN | 168.9000 | NaN | NaN |
1 | City of Music, National Superior Conservatory ... | Christian de Portzamparc | 7470 | (French, born 1944) | (French) | (1944) | (0) | (Male) | 1987 | Paint and colored pencil on print | ... | http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw... | NaN | NaN | NaN | 40.6401 | NaN | NaN | 29.8451 | NaN | NaN |
2 | Villa near Vienna Project, Outside Vienna, Aus... | Emil Hoppe | 7605 | (Austrian, 1876–1957) | (Austrian) | (1876) | (1957) | (Male) | 1903 | Graphite, pen, color pencil, ink, and gouache ... | ... | http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw... | NaN | NaN | NaN | 34.3000 | NaN | NaN | 31.8000 | NaN | NaN |
3 | The Manhattan Transcripts Project, New York, N... | Bernard Tschumi | 7056 | (French and Swiss, born Switzerland 1944) | () | (1944) | (0) | (Male) | 1980 | Photographic reproduction with colored synthet... | ... | http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi... | NaN | NaN | NaN | 50.8000 | NaN | NaN | 50.8000 | NaN | NaN |
4 | Villa, project, outside Vienna, Austria, Exter... | Emil Hoppe | 7605 | (Austrian, 1876–1957) | (Austrian) | (1876) | (1957) | (Male) | 1903 | Graphite, color pencil, ink, and gouache on tr... | ... | http://www.moma.org/media/W1siZiIsIjEyNiJdLFsi... | NaN | NaN | NaN | 38.4000 | NaN | NaN | 19.1000 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
132398 | An Episode in the Life of a Landscape Painter ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2006 | Book from a multiple of archival carrying case... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
132399 | Incense Sweaters & Ice (Installation) | Martine Syms | 68036 | (American, 1988) | (American) | (1988) | (0) | () | 2017 | Video (color, sound) | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4500.0 |
132400 | Just For You | Ken Friedman | 2007 | (American, born 1949) | (American) | (1939) | (0) | (Male) | 1967 | NaN | ... | NaN | NaN | 0.0 | NaN | 0.0000 | NaN | NaN | 0.0000 | NaN | NaN |
132401 | Marey and Demeny Chronophotographs | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1892–1900 | 35mm film (black and white, silent) | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 420.0 |
132402 | Four Saints in Three Acts | Julien Levy | 3522 | (American, 1906–1981) | (American) | (1906) | (1981) | (Male) | 1934 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
132403 rows × 29 columns
Here I am dropping all the columns that might not I do not want/need
df = df.drop(df.columns[[2,9,10,11,12,13,14,16,17,18,19,20,21,22,24,25,27,28]], axis=1)
Here I am going to drop n amount of rows, since altair chart cannot work with large amounts of data (here n being 130000)
n=130000
df.drop(df.tail(n).index,
inplace = True)
Here we are droping na values from the whole dataframe
df=df.dropna()
I at first thought that I needed to drop decimales, or only have the in the tenth place at the furthest but realized the was not really needed; although I still decided to keep it.
df['Width (cm)']=df['Width (cm)']. round(decimals=1)
df['Height (cm)']=df['Height (cm)']. round(decimals=1)
In the next four cells I am taking each value from the BeginDate and EndDate and removing its first and last component, corresponding to the 1 and 5. I am doing this because the dates are trapped in parentheses, ex: (1999), and I need only numrical value inside them. So first I remove the parentses from there then I turn that value into a float value, since even after romeving the parenttheses it still had a object datatype
df['BeginDate']=df['BeginDate'].map(lambda n: (n[1:5]))
df['BeginDate']=pd.to_numeric(df["BeginDate"], errors=("coerce"))
df['EndDate']=df['EndDate'].map(lambda n: (n[1:5]))
df['EndDate']=pd.to_numeric(df["EndDate"], errors=("coerce"))
Importing libraries for regression
from numpy.random import default_rng
rng = default_rng()
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
Here I am going to plot the year an Artist was born on x axis with he year the they died on the y axis. As you go further up and to the right it shows artist that have lived closer to the present. Also it makes sne that there are no outliers in the top left and bottom right corner, since for the top left it would mean someone lived for a non phesable amount of time and bottom right where it would mean that a person is born after they died, it doesnt make sense.
alt.Chart(df).mark_circle().encode(
x = alt.X("BeginDate", scale=alt.Scale(domain=(1730, 2040))),
y =alt.Y("EndDate", scale=alt.Scale(domain=(1730, 2040))),
tooltip ='Artist'
).properties(title='Lifetime')
Here in am dropping nan values from the ‘BeginDate’ and ‘EndDate’ columns.
df.dropna(subset=['BeginDate'], inplace=True)
df.dropna(subset=['EndDate'], inplace=True)
#df2=df
#i=700
#df.drop(df.tail(i).index,
# inplace = True)
Was planning on removing a lot of rows to the dataframe so it would actually be much easire to interprets these graphs below since there wouldn’t be nearly as many artists. With this many all there names overlap so we cannot really pick out a single artist of our choice. Was planning on doing this with code above this cell but it was removing 700 rows from my original data frame, as well, which I couldn’t find out why.
In these graphs below we can see how the birth date and death death of the various artists and how they bounce around from artist to artist.
df.plot(kind='line',x='Artist',y='BeginDate')
df.plot(kind='line',x='Artist',y='EndDate', color='red')
<AxesSubplot:xlabel='Artist'>
Not so important to rescale the data here since each column from the datatframne that I am using are approximatly pretty similar, when comparing we get a trivial difference. Important to remember to rescale if difference becomes very large.
Important to have a test set here as it will test a percentage of trained values, serves as an way to access the performance.
X_train, X_test, y_train, y_test = train_test_split(
df[['EndDate']], df["BeginDate"], test_size = 0.4)
Here we are using regression instead of classification because the data we are using is of numerical values, and no categorical.
reg = KNeighborsRegressor(n_neighbors=8)
Below I am just making sure that each set has no null values because previously, had a set with an additonal null value compared to the other set, so then I had to drop that null value resulting in one set having 1 more row then then the other, making there lenghts different. This made it so I had to manually drop a row from the set that had more in order to fit. Since I wasn’t able to drop the value in one set that was corresponding to the nan value in other set.
y_test.isnull().value_counts()
False 302
Name: BeginDate, dtype: int64
X_test.isnull().value_counts()
EndDate
False 302
dtype: int64
X_train.isnull().value_counts()
EndDate
False 453
dtype: int64
y_train.isnull().value_counts()
False 453
Name: BeginDate, dtype: int64
Here I am fitting the sets, important to remember to use train sets while fitting and not test sets.
reg.fit(X_train, y_train)
KNeighborsRegressor(n_neighbors=8)
Now it is time to see the error in the predictions from the test and train sets. Essentially seing the abosolute value of the difference between the actual values and predicted ones. We can see that as value with the train data is smaller,better performance, we should not worry about overfitting.
mean_absolute_error(reg.predict(X_test), y_test)
4.961920529801325
mean_absolute_error(reg.predict(X_train), y_train)
3.8614790286975715
Here I am using a for loop in order see how the numbers of neighbors chosen will affect the mean absolute error for the train and test set. I will then graph the values in order to try and see how we are doing.
Fisrt need to create a dataframe that I will need to store my values for the errors.
dfs = pd.DataFrame({"k":range(0,70),"train_error":1,"test_error":1})
Here I am getting the actual values, as well as inputing each value for test error into a column of our new data frame, and also inputing each value for train error into a column of new dataframe.
for k in range(1,70):
reg = KNeighborsRegressor(n_neighbors=k)
reg.fit(X_train, y_train)
train_error = mean_absolute_error(reg.predict(X_train), y_train)
test_error = mean_absolute_error(reg.predict(X_test), y_test)
print(f"using {k} nearest neighbors")
print((train_error, test_error))
print(' ')
dfs.loc[k,'test']=test_error
dfs.loc[k,'train']=train_error
using 1 nearest neighbors
(2.5805739514348787, 4.182119205298013)
using 2 nearest neighbors
(2.8311258278145695, 4.130794701986755)
using 3 nearest neighbors
(2.9896983075791, 4.0916114790286935)
using 4 nearest neighbors
(3.05794701986755, 4.408112582781457)
using 5 nearest neighbors
(3.268432671081676, 4.536423841059596)
using 6 nearest neighbors
(3.557395143487858, 4.6578366445916135)
using 7 nearest neighbors
(3.688741721854302, 4.797067171239353)
using 8 nearest neighbors
(3.8614790286975715, 4.961920529801325)
using 9 nearest neighbors
(4.0924699533970985, 5.127667402501834)
using 10 nearest neighbors
(4.466445916114788, 5.420529801324503)
using 11 nearest neighbors
(4.729480232791484, 5.720349187236595)
using 12 nearest neighbors
(4.984363502575416, 5.89183222958057)
using 13 nearest neighbors
(5.140601120733577, 5.9949057564951636)
using 14 nearest neighbors
(5.2945443077893515, 6.137180700094621)
using 15 nearest neighbors
(5.491096394407634, 6.245695364238396)
using 16 nearest neighbors
(5.574503311258278, 6.2365480132450335)
using 17 nearest neighbors
(5.726399168939098, 6.3591741332294465)
using 18 nearest neighbors
(5.813220505273495, 6.4203458425312805)
using 19 nearest neighbors
(5.862437550830697, 6.452596723597049)
using 20 nearest neighbors
(6.05165562913907, 6.609105960264897)
using 21 nearest neighbors
(6.113949332492374, 6.6125827814569496)
using 22 nearest neighbors
(6.2445314067830635, 6.70650210716436)
using 23 nearest neighbors
(6.345426624436115, 6.7539591131586425)
using 24 nearest neighbors
(6.515452538631356, 6.893625827814581)
using 25 nearest neighbors
(6.498454746136858, 6.844768211920527)
using 26 nearest neighbors
(6.52784853115979, 6.863474274070294)
using 27 nearest neighbors
(6.641075954541759, 6.931567328918341)
using 28 nearest neighbors
(6.61439608956165, 6.905392620624408)
using 29 nearest neighbors
(6.663774073228301, 6.959237268782835)
using 30 nearest neighbors
(6.720456217807207, 6.992935982339962)
using 31 nearest neighbors
(6.7894324574521265, 7.030762657551817)
using 32 nearest neighbors
(6.850027593818985, 7.068501655629139)
using 33 nearest neighbors
(6.884139407318208, 7.097330925145488)
using 34 nearest neighbors
(6.939942864563058, 7.138001558239202)
using 35 nearest neighbors
(6.993377483443719, 7.174645222327351)
using 36 nearest neighbors
(6.982953151827335, 7.151306107431945)
using 37 nearest neighbors
(7.035797386790757, 7.136030069804896)
using 38 nearest neighbors
(7.076739862902291, 7.162774485883585)
using 39 nearest neighbors
(7.184921039225692, 7.1719307182883485)
using 40 nearest neighbors
(7.271026490066238, 7.226986754966897)
using 41 nearest neighbors
(7.3235341624939565, 7.235987724115675)
using 42 nearest neighbors
(7.371964679911698, 7.281535793125193)
using 43 nearest neighbors
(7.404435545972565, 7.291313722470332)
using 44 nearest neighbors
(7.480433473810976, 7.345499698976538)
using 45 nearest neighbors
(7.508314937454034, 7.372406181015473)
using 46 nearest neighbors
(7.582445532200798, 7.4105240426144565)
using 47 nearest neighbors
(7.654830679629863, 7.476257573622637)
using 48 nearest neighbors
(7.686442236938938, 7.494757174392944)
using 49 nearest neighbors
(7.690904176240043, 7.485471009595897)
using 50 nearest neighbors
(7.703752759381887, 7.469470198675481)
using 51 nearest neighbors
(7.733454529714762, 7.482469809115699)
using 52 nearest neighbors
(7.7379011716760004, 7.453451349974531)
using 53 nearest neighbors
(7.742304968969973, 7.442459077845812)
using 54 nearest neighbors
(7.749775161474952, 7.424270296786868)
using 55 nearest neighbors
(7.7558900260886885, 7.416556291390717)
using 56 nearest neighbors
(7.761510564490704, 7.401194418164624)
using 57 nearest neighbors
(7.7595755392897185, 7.392761705588466)
using 58 nearest neighbors
(7.758810991855061, 7.361897693537336)
using 59 nearest neighbors
(7.7687731507464415, 7.348692333595244)
using 60 nearest neighbors
(7.778182487122872, 7.343377483443705)
using 61 nearest neighbors
(7.750805196685118, 7.309304092932353)
using 62 nearest neighbors
(7.729901018300927, 7.271523178807939)
using 63 nearest neighbors
(7.76719576719577, 7.283611899505942)
using 64 nearest neighbors
(7.768970750551876, 7.319691639072848)
using 65 nearest neighbors
(7.808796060451697, 7.349465104431993)
using 66 nearest neighbors
(7.895578299551806, 7.419325707405179)
using 67 nearest neighbors
(7.9770683008797, 7.491845408718001)
using 68 nearest neighbors
(7.985813530710276, 7.497175691468632)
using 69 nearest neighbors
(8.020955306011455, 7.540742873596327)
Here we would want to use the amount of neighbors where the errors in the sets is the lowest.
Next dropping first and second columns of my new dataframe because they contain nan values
dfs = dfs.drop(dfs.columns[[1,2]], axis=1)
And now remove row 0, since it countains nan in all columns
dfs = dfs.iloc[1: , :]
Now make a graph showing the train error as the number of k neighbors increases.
ctrain= alt.Chart(dfs).mark_line().encode(
x='k',
y='train'
)
Similarly make a graph showing the test error as the number of k neighbors increases, make it a different colored line to compare easier.
ctest= alt.Chart(dfs).mark_line(color="orange").encode(
x='k',
y='test'
)
Now combine the 2 to compare easier.
Here we should be wary of using more then around 37 neighbors, as not only the error is going up previously but here the test error curve goes under the train error curve. Could be sign of overfitting. Also here it is easy to see with what amount of neighbors the error is the smallest.
ctest+ctrain
Orignally was getting a graph where the curve/line fro test error was below that of the train error one. I think that that problem appeared because I was removing a row from a set that didn’t correspond to the nan value dropped from the other set. Fortunatly this was resolved, I talked about this higher up.
Data frame used was found here https://www.kaggle.com/rishidamarla/art-and-artists-from-the-museum-of-modern-art?select=Artworks.csv
Created in Deepnote