Evalml: Poor performance on diamond dataset

Created on 5 Oct 2020  ·  3Comments  ·  Source: alteryx/evalml

Problem
Automl yields models with negative R2.

Repro
Dataset here.

import evalml
import pandas as pd
import numpy as np
df = pd.read_csv('stones_encoded_small.csv')
y = df.pop('total_sales_price')
automl = evalml.automl.AutoMLSearch(problem_type='regression')
automl.search(df, y)

Data checks will fail due to highly null / single-value columns. You can disable them with data_checks='disabled'. Or, to address them and continue:

cols_to_drop = ['culet_condition', 'fancy_color_dominant_color', 'fancy_color_intensity', 'fancy_color_overtone', 'fancy_color_secondary_color', 'fluor_color', 'image_file_url', 'diamond_id', 'currency_code', 'currency_symbol', 'fancy_color_dominant_color', 'fancy_color_intensity', 'fancy_color_overtone', 'fancy_color_secondary_color', 'has_sarineloupe']
df.drop(columns=cols_to_drop, inplace=True)
automl = evalml.automl.AutoMLSearch(problem_type='regression')
automl.search(df, y)

The results are highly similar either way: negative R2 values for all models, i.e. the models can't produce meaningful results.

Switching the metric to MSE and MAE yields similarly poor models.

Discussion
My first suspicion is that the features aren't getting the right type. When I look at the dtypes inferred by pandas, I see many are set as float64 but only have a few unique values, i.e. they should be set as categorical. I gave that a shot but it didn't seem to change the model results, so there's more to the story.

bug

All 3 comments

Hi Team,

I believe this is related to having an input data set sorted by the target variable, and the sampling method used for the 3-fold cross validation. This data set is sorted by price from lowest to highest. I suspect that the cross validation is splitting the records in order, so the splits are tied to the target variable - meaning that the R2 values are really low because they are being tested against target variable values that were not included in the training data. This behavior is resolved by doing a shuffle on the full data set prior to feeding it in to the search.

  • As @SydneyAyx mentioned, you get a better R2 scores once you shuffle the dataset.
import evalml
import pandas as pd
import numpy as np
from evalml.data_checks import EmptyDataChecks

df = pd.read_csv('stones_encoded_small.csv')

# shuffles data
df = df.sample(frac=1)

y = df.pop('total_sales_price')
automl = evalml.automl.AutoMLSearch(problem_type='regression')
automl.search(df, y, data_checks=EmptyDataChecks()))

Thank you @SydneyAyx @gsheni ! Great detective work there, genius :)

Yes, confirmed. It appears our default data splitters in automl don't currently set shuffle=True.

@SydneyAyx @gsheni one workaround is to shuffle before running automl as @gsheni showed above. Another workaround is to set your own data splitter, like so:

import evalml
import pandas as pd
import numpy as np
import sklearn.model_selection
df = pd.read_csv('stones_encoded_small.csv')
y = df.pop('total_sales_price')

data_splitter = sklearn.model_selection.KFold(n_splits=3, random_state=0, shuffle=True)
automl = evalml.automl.AutoMLSearch(problem_type='regression', data_split=data_splitter)
automl.search(df, y, data_checks='disabled')

I'll get a PR up with the evalml fix.

Was this page helpful?
0 / 5 - 0 ratings