Evalml: 单元测试超时（Dask 不稳定性）

创建于 2021-06-09 · 11评论 · 资料来源: alteryx/evalml

我们目前看到单元测试达到 6 小时的 GH 操作限制。由于显而易见的原因，这并不好。

3.8 核心 deps 6 小时超时（进行中）
build_conda_pkg，3.8 个核心 deps，3.7 个非核心 deps 6 小时超时（进行中）
3.7 非核心 deps 6 小时超时
3.8 非核心 deps 6 小时超时
3.7 非核心深度1.5 小时
build_conda_pkg
3.7 非核心依赖
 3.8

blocker bug testing

资料来源

chukarsten

最有用的评论

我现在在 build_conda_pkg 中看到以下堆栈跟踪

[gw3] linux -- Python 3.7.10 $PREFIX/bin/python

X_y_binary_cls = (          0         1         2   ...        17        18        19
0  -0.039268  0.131912 -0.211206  ...  1.976989  ...ns], 0     0
1     0
2     1
3     1
4     1
     ..
95    1
96    1
97    1
98    1
99    0
Length: 100, dtype: int64)
cluster = LocalCluster(15c4b3ad, 'tcp://127.0.0.1:45201', workers=0, threads=0, memory=0 B)

    def test_submit_training_jobs_multiple(X_y_binary_cls, cluster):
        """Test that training multiple pipelines using the parallel engine produces the
        same results as the sequential engine."""
        X, y = X_y_binary_cls
        with Client(cluster) as client:
            pipelines = [
                BinaryClassificationPipeline(
                    component_graph=["Logistic Regression Classifier"],
                    parameters={"Logistic Regression Classifier": {"n_jobs": 1}},
                ),
                BinaryClassificationPipeline(component_graph=["Baseline Classifier"]),
                BinaryClassificationPipeline(component_graph=["SVM Classifier"]),
            ]

            def fit_pipelines(pipelines, engine):
                futures = []
                for pipeline in pipelines:
                    futures.append(
                        engine.submit_training_job(
                            X=X, y=y, automl_config=automl_data, pipeline=pipeline
                        )
                    )
                results = [f.get_result() for f in futures]
                return results

            # Verify all pipelines are trained and fitted.
            seq_pipelines = fit_pipelines(pipelines, SequentialEngine())
            for pipeline in seq_pipelines:
                assert pipeline._is_fitted

            # Verify all pipelines are trained and fitted.
>           par_pipelines = fit_pipelines(pipelines, DaskEngine(client=client))

evalml/tests/automl_tests/dask_tests/test_dask_engine.py:103: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in fit_pipelines
    results = [f.get_result() for f in futures]
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in <listcomp>
    results = [f.get_result() for f in futures]
evalml/automl/engine/dask_engine.py:30: in get_result
    return self.work.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Future: cancelled, key: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c>
timeout = None

    def result(self, timeout=None):
        """Wait until computation completes, gather result to local process.

        If *timeout* seconds are elapsed before returning, a
        ``dask.distributed.TimeoutError`` is raised.
        """
        if self.client.asynchronous:
            return self.client.sync(self._result, callback_timeout=timeout)

        # shorten error traceback
        result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
        if self.status == "error":
            typ, exc, tb = result
            raise exc.with_traceback(tb)
        elif self.status == "cancelled":
>           raise result
E           concurrent.futures._base.CancelledError: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c

这似乎是 dask https://github.com/dask/distributed/issues/4612 中的一个已知问题

freddyaboulton 于 2021-06-15

👍2

所有11条评论

只需从这个 3.8 核心 deps 中添加一些数据即可

github_unittests.txt

我想我注意到的一件事是它们都在 91-93% 完成标记附近暂停。我怀疑弄清楚这些测试是否有任何价值，但这可能是一条追求的途径。

chukarsten 于 2021-06-10

这是3.9 非核心 deps

github_unittests_2.txt

chukarsten 于 2021-06-10

感谢您提交@chukarsten

值得庆幸的是，我们可以排除 conda 作为原因，因为这发生在我们正常的单元测试构建中，而不仅仅是build_conda_pkg

是否还有其他我们应该收集的信息可以帮助我们解决这个问题？下面的一些想法

我们如何可靠地重现超时？我们运行单元测试作业的时间是否为 50%，更多，更少？
哪个或哪些测试没有正确完成？如果我们可以让pytest记录每个测试的开始和结束，我们可以通过查看日志推断出挂起时哪个测试没有结束。这看起来可能有用。
如果我们在没有任何 pytest 并行化的情况下运行测试，我们还会看到这些超时吗？
这只是一种预感，但是如果我们禁用 dask 引擎测试会发生什么？我知道我们最近看到了一些碎片 #2341
测试运行时 CPU 和内存利用率如何？

dsherry 于 2021-06-10

（ @freddyaboulton我在这里添加了你，因为它连接到 #2298 和 #1815）

dsherry 于 2021-06-10

更改 Makefile 以使用 pytest 进行详细日志记录，我们得到以下日志
. 这显示最后执行的测试是“evalml/tuners/random_search_tuner.py::evalml.tuners.random_search_tuner.RandomSearchTuner”

chukarsten 于 2021-06-11

添加超时后，我在test_dask_sends_woodwork_schema上看到相同的超时至少 3 次：

freddyaboulton 于 2021-06-11

我认为@freddyaboulton肯定在这里有所作为，我们坚定地指向 Dask。制作这个 PR来分离 dask 单元测试。我认为我们可以选择在失败时不阻止合并。此 PR 在 test_automl_immediate_quit 上失败，它仍在 dask 测试数组中。

调查 dask 单元测试失败的根本原因令人费解。日志产生了很多这样的：

distributed.worker - WARNING - Could not find data: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:45587']} on workers: [] (who_has: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:45587']})
distributed.scheduler - WARNING - Communication failed during replication: {'status': 'missing-data', 'keys'

为什么会发生这种情况？好吧，似乎无论在何处处理的数据都丢失了对该数据的引用。此外，“工人：[]”表明保姆进程可能正在杀死工人。我怀疑数据是如何分散的，但我也怀疑这四个工作在伪并行/串行中一起运行时发生了什么。

此 dask 分布式问题建议禁用集群的自适应缩放。不幸的是，我们不使用自适应集群，只使用常规的本地静态集群，所以这不是问题。这个问题指出数据的分散是

chukarsten 于 2021-06-14

在尝试 #2376 分离 dask 作业并为 DaskEngine 的客户端设置broadcast=False ，默认情况下，我的 test_automl_immediate_quit 测试失败。记录在这里。

chukarsten 于 2021-06-14

我现在在 build_conda_pkg 中看到以下堆栈跟踪

[gw3] linux -- Python 3.7.10 $PREFIX/bin/python

X_y_binary_cls = (          0         1         2   ...        17        18        19
0  -0.039268  0.131912 -0.211206  ...  1.976989  ...ns], 0     0
1     0
2     1
3     1
4     1
     ..
95    1
96    1
97    1
98    1
99    0
Length: 100, dtype: int64)
cluster = LocalCluster(15c4b3ad, 'tcp://127.0.0.1:45201', workers=0, threads=0, memory=0 B)

    def test_submit_training_jobs_multiple(X_y_binary_cls, cluster):
        """Test that training multiple pipelines using the parallel engine produces the
        same results as the sequential engine."""
        X, y = X_y_binary_cls
        with Client(cluster) as client:
            pipelines = [
                BinaryClassificationPipeline(
                    component_graph=["Logistic Regression Classifier"],
                    parameters={"Logistic Regression Classifier": {"n_jobs": 1}},
                ),
                BinaryClassificationPipeline(component_graph=["Baseline Classifier"]),
                BinaryClassificationPipeline(component_graph=["SVM Classifier"]),
            ]

            def fit_pipelines(pipelines, engine):
                futures = []
                for pipeline in pipelines:
                    futures.append(
                        engine.submit_training_job(
                            X=X, y=y, automl_config=automl_data, pipeline=pipeline
                        )
                    )
                results = [f.get_result() for f in futures]
                return results

            # Verify all pipelines are trained and fitted.
            seq_pipelines = fit_pipelines(pipelines, SequentialEngine())
            for pipeline in seq_pipelines:
                assert pipeline._is_fitted

            # Verify all pipelines are trained and fitted.
>           par_pipelines = fit_pipelines(pipelines, DaskEngine(client=client))

evalml/tests/automl_tests/dask_tests/test_dask_engine.py:103: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in fit_pipelines
    results = [f.get_result() for f in futures]
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in <listcomp>
    results = [f.get_result() for f in futures]
evalml/automl/engine/dask_engine.py:30: in get_result
    return self.work.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Future: cancelled, key: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c>
timeout = None

    def result(self, timeout=None):
        """Wait until computation completes, gather result to local process.

        If *timeout* seconds are elapsed before returning, a
        ``dask.distributed.TimeoutError`` is raised.
        """
        if self.client.asynchronous:
            return self.client.sync(self._result, callback_timeout=timeout)

        # shorten error traceback
        result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
        if self.status == "error":
            typ, exc, tb = result
            raise exc.with_traceback(tb)
        elif self.status == "cancelled":
>           raise result
E           concurrent.futures._base.CancelledError: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c

这似乎是 dask https://github.com/dask/distributed/issues/4612 中的一个已知问题

freddyaboulton 于 2021-06-15

👍2

删除了我的旧帖子，但这是一个红色帖子： https : @freddyaboulton 的堆栈跟踪相同。

angela97lin 于 2021-06-15

我相信这个问题不再阻止每个 [this PR] 来分离 dask 作业（https://github.com/alteryx/evalml/pull/2376），这个 PR重构 dask 作业以减少碎片，以及此 PR使单独的 dask 作业不会因合并到主程序而阻塞，此 PR添加超时以防止病理性 dask 测试花费 6 小时而最终被 GH Actions 取消。

将其移至关闭状态，因为与 dask 相关的超时现在不再是问题，并且在可预见的未来不应该出现。然而，根本原因仍然未知。

chukarsten 于 2021-06-17

此页面是否有帮助？

0 / 5 - 0 等级

Evalml: 单元测试超时（Dask 不稳定性）

最有用的评论

所有11条评论

相关问题