Evalml: Error handling for code with third-party dependencies

Created on 22 Jan 2020  ·  10Comments  ·  Source: alteryx/evalml

We currently have at least one third-party library dependency (xgboost) and another on the way (catboost #247). This issue tracks figuring out how we present that to users.

Questions:

  • What happens when a user tries to run a pipeline with a third-party dependency which isn't installed?
  • How can we allow users to opt in or out from using pipelines with third-party dependencies during automl?

My current thought is that the code @angela97lin is adding in the catboost PR is a good start, i.e. we have each estimator or component throw an error when the underlying library is missing. But I think we can do more that that too. We could catch that error in the automl, skip the pipeline in question, perhaps print a warning, and continue the search. We could also ask users to install third-party deps by default, by including them in requirements.txt, which should cover the majority of cases.

Adding related comments made in #247 to keep things in one place:

  • Do we include pipelines with third-party dependencies in the automl search? (I think the fact that we're adding them to the codebase implies "yes" unless we discover they're not useful for some reason)
  • What should the behavior of third-party pipelines and components be for users who don't have the third-party dependency installed? There's several options: error, warning, silent failure/skip, or do a library check before the automl search starts and don't include that pipeline to begin with (this last is my favorite)
  • How do we add test coverage of our answers to the above?
enhancement needs design refactor

All 10 comments

Note @jeremyliweishih we should decide if this sort of thing is in-scope for the pipeline project or if we want to punt it

I think it should be it's own one-off issue and not part of the requirements for the pipeline project as there isn't much tie in with anything else part of the project. We can always add the pipeline project as a blocker for a long term fix and include a simple patch (include all third-party deps so far) first.

That sounds good to me. In that case I think we should consider this as blocked on the pipeline project, because the solution could change depending on what we change about pipelines

@kmax12 :

My instinct here is that we should complete the phase 1 pipeline project @jeremyliweishih is working on, then circle back to this issue and build a sustainable way to make third-party dependencies optional.

And until then, we may continue to merge things like #247 (which add new third-party deps to requirements.txt but also add rudimentary error handling as a fallback), with the understanding we'll circle back and make them optional when we address this issue.

Do you have an opinion on that plan?

@kmax12 said in meeting today:

  • He's on board with this plan. Separating third-party deps from the pipeline project
  • Two categories of third-party deps: 1) libs for modeling pipelines, 2) feature-specific libraries like S3
  • Look at featuretools: pip install eval[complete]. And evalml
  • It's good to have a bare-bones installation, which doesn't require everything
  • There's some deps in there which should be removed altogether, like plotting libs
  • Next step for this feature is a design doc on how we'd get this done
  • For now, import_or_raise is helpful to use, as Angela did in catboost PR

Adding additional notes from comments made in #247 :

@dsherry mentioned cb_error_msg (error message raised when catboost not installed) and other similar messages could be an attribute of Estimator. For example, we could do the following:

  • Define a libs variable in Estimator which is an empty list by default
  • Define class method Estimator.libs_err_msg(lib), which by default takes a format string "{lib} is not installed. Please install using pip install {lib}." and applies the lib argument to it
  • Have Estimator.__init__ do something like for lib in self.libs: import_or_raise(lib, self.libs_err_msg(lib))
  • Have each class define libs to be a list with one or more third-party library names
  • And optionally, each class could define an override of libs_err_msg(lib) if needed

I think it could be even worth having at the ComponentBase level so that any component requiring a third-party library could use this framework.

I'll try to knock this out this week. Next step is to list out reqs and desired behavior, and decide on an API.

@rwedge and I discussed this yesterday.

Background on configuring installation via setuptools
Setuptools supports "extras" which we could use to do something like pip install evalml[complete] which can be configured to install extra packages on top of pip install evalml.

Note it appears this mechanism supports installing additional packages, and potentially updating the versions of previously installed packages, but not uninstalling packages. That's fine.

Installation options
Here's some ways we could use pip extras to rearrange our required dependencies to optionally exclude third-party libs:

  1. evalml: only install sklearn. evalml[complete]: also include third-party (xgboost/catboost)
  2. evalml: do nothing / error. evalml[minimal]: only install sklearn. evalml[complete]: also include third-party.
  3. There may be a way to rig things up such that using evalml includes third-party but evalml[minimal] does not, but unsure.

Option 1 feels the most appealing. I'm unaware of any good options for handling this stuff beyond sticking with our current setuptools setup.

Code support options
Not necessarily mutually exclusive:

  1. Run import_or_raise at fit-time; have automl skip on failure
  2. Have each pipeline list third-party deps as metadata; run import_or_raise at init-time; have automl exclude pipelines whose imports fail
  3. Introspection: write some code which scans through the packages used by each pipeline class at init-time
  4. Registry: define a singleton class which could provide a central listing of pipelines and could encapsulate some of this functionality. We've discussed this in the design doc/notes for #345

Option 1 feels best to me.

Questions for consideration

  • If we use setuptools extras, are we comfortable supporting more than one pip install target? We'll need to have test coverage for both.
  • What about preprocessing components which use third-party libraries? How does our strategy need to change, if at all?
  • When do we want checks to happen, at init-time or fit-time?

Proposal

  • Move third-party dependencies to extras, meaning pip install evaml won't install them but pip install evalml[complete] will.
  • Have all third-party pipelines (xgboost/catboost) run import_or_raise (already done)
  • Ensure automl skips third-party pipelines if import_or_raise fails.
  • Testing: update current tests to install complete version. Add an integration test which installs minimal version and checks key functionality. Run that on checkins to master.
  • Update documentation.

Advantages: Allows users to install evalml without third-party libraries.
Disadvantages: Default install isn't as powerful. We'd have to support two install versions. Could be ugly setting up automl to skip failed pipelines.

Long-term, it would be neat if we can find a solution which raises an error or warning earlier than fit-time.

@kmax12 do you have an opinion on this issue?

I did a review of all the dependencies in requirements.txt. I included the size of each library's lib dir in the virtualenv on my mac, to gauge importance.

Packages which are required by evalml and would be quite difficult to make optional:

  • numpy (86MB): required
  • pandas (47MB): required
  • scipy (126MB): required
  • scikit-learn (29MB): required
  • scikit-optimize[plots] (572KB): required for automl tuners
  • category_encoders (776KB): required for one-hot encoder
  • cloudpickle (132KB): required for saving and loading pipelines

Packages which could potentially be removed:

  • joblib (1.9MB): unsure, no direct refs. I'll try removing and update.
  • plotly (51MB): used to generate pipeline plots, in documentation. It's possible we could update to be disabled by default.
  • ipywidgets (824KB): used to generate pipeline plots, in documentation. It's possible we could update to be disabled by default, in conjunction with plotly
  • dask[complete] (50MB): currently only referenced in a utility used in demo code. When we add distributed support, it will become required for that. We could use import_or_raise there for now though
  • colorama (72KB): used in logging for color definitions. We could probably remove it, but maybe not worth it since it’s under 100KB.
  • tqdm (316KB): powers the console output in automl search. We may be able to update it to simply be disabled by default. But it's a small package.

Conclusion: we could reduce installer size by a lot if we make more things optional.

Does this affect our plan for handling third-party pipelines? We could move towards supporting multiple extras:

  • pip install evalml: minimal only, no third-party blueprints, no plotting, no distributed support
  • pip install evalml[thirdparty]: include third-party blueprints
  • pip install evalml[plotting]: minimal, plus plotting
  • pip install evalml[distributed]: minimal, plus distributed support
  • pip install evalml[plotting,distributed,thirdparty] or pip install evalml[complete]: include everything

This would work fine. It feels overly complicated though. And it raises concerns about testing: we'd at least need an integration test for each to ensure the library works with the subset of deps.

An alternative would be to only support two targets, the minimal default and complete which includes everything else. Users could then install specific dependencies by hand if they wanted a subset of complete.

Solution discussed with @kmax12 :

  • To avoid deps: pip install --no-dependencies evalml then install specific deps manually
  • We can even make a Makefile command for that
  • Update documentation to describe how to do this
  • Our code doesn't explode if a package is missing
  • By default: include xgboost/catboost. Include plotting. Exclude dask/distributed (update code to not use that)

An option to consider for long term: build two setup.py packages: evalml-base minimal and evalml everything.

Was this page helpful?
0 / 5 - 0 ratings