Evalml: Docs take a long time to build

Created on 10 Nov 2020  ·  8Comments  ·  Source: alteryx/evalml

As of late our docs take ~14 minutes to build on circle-ci whereas they took about 6 minutes to build in the previous release. The root cause of this slow-down seems to be that woodwork is inferring some categorical variables as text which then causes AutoML to use the TextFeaturizer. However, even if ww fixes the categorical vs text inference, the time to build the docs will inevitably increase as we write more documentation. This makes it hard for developers to iterate on the docs locally.

Possible solutions:

  • Add some hidden code in the notebook that would skip long running computations.
  • Have nb-sphinx or read the docs cache long-running computations.
documentation testing

Most helpful comment

Update following discussion with @dsherry.

Adding in the -j flag to our Makefile allows the build docs test on circleci to finish faster, as seen here. Unfortunately, ReadtheDocs doesn't run this command, which means that the actual generation of published documentation still takes a while and often errors out.

This is what a successful build looks like for ReadtheDocs, taking a little over 20 minutes to complete. The differences between the HTML and Latex build times suggests that building the Jupyter notebooks themselves do not take a lot of time, which is good.

However, we're also finding instances where the build fails like this. We noticed that for some reason, ReadtheDocs is running the full sequence of commands twice, which causes the build to take much longer (well over 30 minutes each to create the HTML and latex files), and causes the doc build to fail. I'll follow up with the ReadtheDocs support team to see why this is happening and how we can fix this, and I'll update with those results here when I get feedback.

All 8 comments

Yep. I changed the default automl stopping criterion to max_batches=1 a couple weeks back also, which didn't help.

I like the solutions you listed! Plus one of my own:

  1. Add some hidden code in the notebook that would skip long running computations. This could be code which mocks pipeline fit/predict. Advantage: works. Disadvantage: may not match with what users get when they run by hand, plus hidden code is confusing.
  2. For long-running notebooks, pre-run locally one time and save the output in the notebook. Nbsphinx will use a saved execution if one exists instead of rerunning. Advantage: works. Disadvantage: we may forget to periodically update the output.
  3. Simplify / delete some of the notebook content. For example, consider lowering data size, stopping criterion etc. if possible. Advantage: speedups. Disadvantage: can't show full output for some examples, like text.

I recommend we go with option 2, but with option 3 in mind.

1627 was closed as a duplicate, but I think there could still be something there that wasn't covered in this issue, so posting here:

I noticed that docs have been taking much longer to build. I think this is likely because the automl docs were changed in c871f3b to use the fraud dataset, instead of the breast cancer data set (+ elsewhere?) to showcase infer_problem_types, since the breast cancer dataset only has numeric columns.

I suspect this is a different issue / reason for the even-longer build time of docs, from the previous 20 minutes to now >30 minutes, and could be worth mentioning!

@dsherry FYI

Another possible solution is to use multiple processors to build the docs:

https://www.sphinx-doc.org/en/master/man/sphinx-build.html#cmdoption-sphinx-build-j

Update following discussion with @dsherry.

Adding in the -j flag to our Makefile allows the build docs test on circleci to finish faster, as seen here. Unfortunately, ReadtheDocs doesn't run this command, which means that the actual generation of published documentation still takes a while and often errors out.

This is what a successful build looks like for ReadtheDocs, taking a little over 20 minutes to complete. The differences between the HTML and Latex build times suggests that building the Jupyter notebooks themselves do not take a lot of time, which is good.

However, we're also finding instances where the build fails like this. We noticed that for some reason, ReadtheDocs is running the full sequence of commands twice, which causes the build to take much longer (well over 30 minutes each to create the HTML and latex files), and causes the doc build to fail. I'll follow up with the ReadtheDocs support team to see why this is happening and how we can fix this, and I'll update with those results here when I get feedback.

@bchen1116 contacted support and they said

It looks like the underlying cause of this bug is the number of active versions that you have. I see a few errors in our logs related to this.
To work around this for now, you might reduce the number of active versions that you keep. It looks like you are building versions for individual branches or pull requests, have you tried our pull request building feature? This would help remove the unneeded versions after building, while still keeping the built content.

I believe the "pull request building feature" referenced here is this, confirming.

Update:
We've updated RTD to build from pull requests only, removing the unnecessary builds to different versions (branches) that we push. Additionally, we've deleted all unnecessary (untagged) versions from RTD (miscellaneous branches that we use for PRs), which seems to have helped the doc builds. We don't notice any docs timing out on builds, so we will close this issue tomorrow unless we begin seeing timeouts again.

@bchen1116 is this closeable now?

Closing now, as there's been no issue with slow doc builds.

Was this page helpful?
0 / 5 - 0 ratings