Evalml: AutoMLSearch: calling search twice on same instance doesn't work

Created on 11 Nov 2020 · 5Comments · Source: alteryx/evalml

Not sure if this is intended behavior, but when I call automl.search(X, y) on the same automl object, the second search will run the baseline round and then quit because the number of iterations starts at 6 (or 1+num of iterations from first search):

bug

Source

angela97lin

All 5 comments

@angela97lin I think the bug is that we score the baseline the second time but I think it's intended that no non-baseline pipelines are searched over the second time.

I think the second call to search should be a no-op so maybe we should refactor the search to only score the baseline on the first "round". What do you think?

freddyaboulton on 14 Dec 2020

@freddyaboulton Hm, I like your proposal! One consideration, though: if the user passes in a different X, y for the same automl object, we'd still want to recalculate the baseline scores, right? Yet we don't have a way to keep track of a session yet so I wonder if it's best to just recalculate baseline just in case?

On a larger scale, is it intended that a user is able to reuse the same automl object? Or should they create a new object every time they want to run search?

angela97lin on 14 Dec 2020

Great point @angela97lin ! I think our current AutoMLSearch design lends itself to the "one object per search" paradigm you mention. My reasoning is that a lot of the configuration parameters the user specifies when creating the search are specific to the problem/dataset at hand (problem_type, allowed_pipelines, objective). It's possible that the values for these parameters are reasonable for one kind of X, y but not any other given X, y of the same problem type. Moreover, the rankings table will be misleading if we let the user call search on separate datasets since the cv scores will not be directly comparable.

So I think if we want to follow what we have so far, we should do the "no-op" solution: make sure no pipelines are scored on subsequent calls to search if the stopping criteria has been met. I think we should only recalculate the baselines if we change our design to allow for reusing the same search object.

Happy to talk about whether we should refactor/redesign AutoMLSearch to allow calling search multiple times. I don't have a strong opinion yet of what would be most useful for our end users!

freddyaboulton on 14 Dec 2020

👍1

@angela97lin @freddyaboulton thanks for filing and great discussion. I agree!

As you both alluded to, we've been talking about this bottom-up, i.e. "given our current API what should the behavior be?", but we should also consider this top-down, i.e. "what are things users want to do with automl search?" I think that if we decide we want to support behavior like pausing and resuming searches, we should consider building a different API for that ( #1047 ) before we invest time into updating AutoMLSearch further.

With that in mind, options for what should happen when we call search again on an AutoMLSearch instance after the first call (not necessarily mutually exclusive):

Current buggy behavior
Error: "running search more than once on an AutoMLSearch instance is not allowed"
No-op: nothing happens.
Rerun entire search from scratch. All state which was created during previous calls search gets overwritten
"Continue" or "resume" the search from where it left off

I agree that option 0 (current behavior) is buggy and we should change it. For the time being I like option 2 or 3 the most. I think we should go for option 3 for now, and if that feels too complicated to build, we can fall back to option 2. Long-term I actually like option 4 (continuing) the most but I think we should punt on that for now.

dsherry on 16 Dec 2020

👍1

@dsherry @angela97lin @freddyaboulton Now that AutoMLSearch takes in X_train and y_train through the initialization of AutoMLSearch rather than in the search function, is option 3 still the best way to proceed? Since the X and y data for the search would be the same if they simply called search, we could just have the call be no-op rather than running the search from scratch? But other than that, I don't think there's much more to do on this issue. Let me know your thoughts!

bchen1116 on 4 Jan 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Allow components which are not "leaf" children in the class hierarchy

angela97lin · 4Comments

Update pipeline and components to return Woodwork data structures

angela97lin · 5Comments

Add partial dependence plot

dsherry · 3Comments

Poor performance on diamond dataset

dsherry · 3Comments

BalancedClassificationDataCVSplit produces different splits each time it's called

freddyaboulton · 3Comments