Evalml: Unit Test Timeouts (Dask Instability)

Created on 9 Jun 2021 · 11Comments · Source: alteryx/evalml

We're currently seeing unit tests go to the GH Actions limit of 6hrs. This is not good for obvious reasons.

3.8 core deps 6 hr timeout (in progress)
build_conda_pkg, 3.8 core deps, 3.7 non-core deps 6 hr timeout (in progress)
3.7 non-core deps 6 hr timeout
3.8 non-core deps 6 hr timeout
3.7 non-core deps 1.5 hrs
build_conda_pkg
3.7 non-core deps
3.8

blocker bug testing

Source

chukarsten

Most helpful comment

I am now seeing the following stacktrace in build_conda_pkg

[gw3] linux -- Python 3.7.10 $PREFIX/bin/python

X_y_binary_cls = (          0         1         2   ...        17        18        19
0  -0.039268  0.131912 -0.211206  ...  1.976989  ...ns], 0     0
1     0
2     1
3     1
4     1
     ..
95    1
96    1
97    1
98    1
99    0
Length: 100, dtype: int64)
cluster = LocalCluster(15c4b3ad, 'tcp://127.0.0.1:45201', workers=0, threads=0, memory=0 B)

    def test_submit_training_jobs_multiple(X_y_binary_cls, cluster):
        """Test that training multiple pipelines using the parallel engine produces the
        same results as the sequential engine."""
        X, y = X_y_binary_cls
        with Client(cluster) as client:
            pipelines = [
                BinaryClassificationPipeline(
                    component_graph=["Logistic Regression Classifier"],
                    parameters={"Logistic Regression Classifier": {"n_jobs": 1}},
                ),
                BinaryClassificationPipeline(component_graph=["Baseline Classifier"]),
                BinaryClassificationPipeline(component_graph=["SVM Classifier"]),
            ]

            def fit_pipelines(pipelines, engine):
                futures = []
                for pipeline in pipelines:
                    futures.append(
                        engine.submit_training_job(
                            X=X, y=y, automl_config=automl_data, pipeline=pipeline
                        )
                    )
                results = [f.get_result() for f in futures]
                return results

            # Verify all pipelines are trained and fitted.
            seq_pipelines = fit_pipelines(pipelines, SequentialEngine())
            for pipeline in seq_pipelines:
                assert pipeline._is_fitted

            # Verify all pipelines are trained and fitted.
>           par_pipelines = fit_pipelines(pipelines, DaskEngine(client=client))

evalml/tests/automl_tests/dask_tests/test_dask_engine.py:103: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in fit_pipelines
    results = [f.get_result() for f in futures]
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in <listcomp>
    results = [f.get_result() for f in futures]
evalml/automl/engine/dask_engine.py:30: in get_result
    return self.work.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Future: cancelled, key: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c>
timeout = None

    def result(self, timeout=None):
        """Wait until computation completes, gather result to local process.

        If *timeout* seconds are elapsed before returning, a
        ``dask.distributed.TimeoutError`` is raised.
        """
        if self.client.asynchronous:
            return self.client.sync(self._result, callback_timeout=timeout)

        # shorten error traceback
        result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
        if self.status == "error":
            typ, exc, tb = result
            raise exc.with_traceback(tb)
        elif self.status == "cancelled":
>           raise result
E           concurrent.futures._base.CancelledError: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c

This seems to be a known issue in dask https://github.com/dask/distributed/issues/4612

freddyaboulton on 15 Jun 2021

👍2

All 11 comments

Just adding some data from this 3.8 core deps run series of checks. Adding the logs from that run.

github_unittests.txt

I think one thing I've noticed is that they're all pausing around the 91-93% completed mark. I doubt there's any value to figuring out which tests those are, but that might be a route to pursue.

chukarsten on 10 Jun 2021

Here's another for 3.9 non-core deps.

github_unittests_2.txt

chukarsten on 10 Jun 2021

Thanks for filing @chukarsten

Thankfully we can rule out conda as a cause, since this happens for our normal unit test builds and not just for build_conda_pkg

Is there any other info we should collect which could help us figure this out? A few ideas below

How reliably can we repro the timeout? Is it happening 50% of the time we run unit test job, more, less?
Which test or tests is not completing properly? If we can get pytest to log the start and end of each test, we can look at the logs and deduce which test has not ended when the hanging occurs. This looked potentially useful.
Do we still see these timeouts if we run tests without any pytest parallelization?
This is just a hunch, but what happens if we disable the dask engine tests? I know we've seen some flakes with those recently #2341
What does CPU and memory utilization look like while the tests are running?

dsherry on 10 Jun 2021

(@freddyaboulton I added you on here since this connects to #2298 and #1815)

dsherry on 10 Jun 2021

Changing the Makefile to do verbose logging with pytest, we get the following log
. This shows the last executed test to be "evalml/tuners/random_search_tuner.py::evalml.tuners.random_search_tuner.RandomSearchTuner"

chukarsten on 11 Jun 2021

After adding timeout I've seen the same timeout on test_dask_sends_woodwork_schema at least three time:

freddyaboulton on 11 Jun 2021

I think @freddyaboulton is certainly onto something here and we're pointing firmly at Dask. Made this PR to separate out the dask unit tests. I think we have the option of not preventing merge upon their failing. This PR failed on test_automl_immediate_quit, which is still in the array of dask tests.

Looking into the root cause of the dask unit test failures is puzzling. The logs generate a lot of this :

distributed.worker - WARNING - Could not find data: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:45587']} on workers: [] (who_has: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:45587']})
distributed.scheduler - WARNING - Communication failed during replication: {'status': 'missing-data', 'keys'

Why does this happen? Well it seems that wherever the data that's being acted upon is losing reference to that data. Additionally the 'workers: []' suggests that perhaps the nanny process is killing the workers. I suspect there's something going on with how the data is scattered but am also suspicious of what's happening going under the covers with these four jobs running together in pseudo parallel/series.

This dask distributed issue suggests disabling adaptive scaling for the cluster. Unfortunately we don't use adaptive clusters, just regular local, static clusters, so that's not the issue. This issue points at scattering of the data as the potential cause of the issue, where workers are being abandoned, but we're not getting the same connection errors.

chukarsten on 14 Jun 2021

After trying #2376 to separate out the dask jobs and settingbroadcast=False for the DaskEngine's client, by default, I have a flaky test failure with test_automl_immediate_quit. Documented here.

chukarsten on 14 Jun 2021

I am now seeing the following stacktrace in build_conda_pkg

[gw3] linux -- Python 3.7.10 $PREFIX/bin/python

X_y_binary_cls = (          0         1         2   ...        17        18        19
0  -0.039268  0.131912 -0.211206  ...  1.976989  ...ns], 0     0
1     0
2     1
3     1
4     1
     ..
95    1
96    1
97    1
98    1
99    0
Length: 100, dtype: int64)
cluster = LocalCluster(15c4b3ad, 'tcp://127.0.0.1:45201', workers=0, threads=0, memory=0 B)

    def test_submit_training_jobs_multiple(X_y_binary_cls, cluster):
        """Test that training multiple pipelines using the parallel engine produces the
        same results as the sequential engine."""
        X, y = X_y_binary_cls
        with Client(cluster) as client:
            pipelines = [
                BinaryClassificationPipeline(
                    component_graph=["Logistic Regression Classifier"],
                    parameters={"Logistic Regression Classifier": {"n_jobs": 1}},
                ),
                BinaryClassificationPipeline(component_graph=["Baseline Classifier"]),
                BinaryClassificationPipeline(component_graph=["SVM Classifier"]),
            ]

            def fit_pipelines(pipelines, engine):
                futures = []
                for pipeline in pipelines:
                    futures.append(
                        engine.submit_training_job(
                            X=X, y=y, automl_config=automl_data, pipeline=pipeline
                        )
                    )
                results = [f.get_result() for f in futures]
                return results

            # Verify all pipelines are trained and fitted.
            seq_pipelines = fit_pipelines(pipelines, SequentialEngine())
            for pipeline in seq_pipelines:
                assert pipeline._is_fitted

            # Verify all pipelines are trained and fitted.
>           par_pipelines = fit_pipelines(pipelines, DaskEngine(client=client))

evalml/tests/automl_tests/dask_tests/test_dask_engine.py:103: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in fit_pipelines
    results = [f.get_result() for f in futures]
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in <listcomp>
    results = [f.get_result() for f in futures]
evalml/automl/engine/dask_engine.py:30: in get_result
    return self.work.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Future: cancelled, key: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c>
timeout = None

    def result(self, timeout=None):
        """Wait until computation completes, gather result to local process.

        If *timeout* seconds are elapsed before returning, a
        ``dask.distributed.TimeoutError`` is raised.
        """
        if self.client.asynchronous:
            return self.client.sync(self._result, callback_timeout=timeout)

        # shorten error traceback
        result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
        if self.status == "error":
            typ, exc, tb = result
            raise exc.with_traceback(tb)
        elif self.status == "cancelled":
>           raise result
E           concurrent.futures._base.CancelledError: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c

This seems to be a known issue in dask https://github.com/dask/distributed/issues/4612

freddyaboulton on 15 Jun 2021

👍2

Deleted my old post but here's a red one: https://github.com/alteryx/evalml/actions/runs/939673304, seems to be the same stack trace @freddyaboulton posted above.

angela97lin on 15 Jun 2021

I believe this issue no longer blocks per [this PR] to separate out the dask jobs(https://github.com/alteryx/evalml/pull/2376), this PR to refactor the dask jobs to cut down on flakes, and this PR to make the separate dask jobs not blocking for merge to main and this PR to add a timeout to prevent pathological dask tests from taking 6 hours to ultimately be cancelled by GH Actions.

Going to move this to closed because the dask related timeouts are now no longer an issue and shouldn't be for the foreseeable future. However, the underlying cause is still unknown.

chukarsten on 17 Jun 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Warning messages in unit test: "invalid value encountered in double_scalars" and others

dsherry · 3Comments

Disallow recall as an automl objective

dsherry · 5Comments

Have automl auto-fit the best pipeline on entire training data

dsherry · 3Comments

Imputer cannot fit when there is None in a categorical or boolean column

freddyaboulton · 3Comments

SHAP test fails with the Elastic Net Classifier estimator

bchen1116 · 4Comments