Evalml: λ‹¨μœ„ ν…ŒμŠ€νŠΈ μ‹œκ°„ 초과(Dask Instability)

에 λ§Œλ“  2021λ…„ 06μ›” 09일  Β·  11μ½”λ©˜νŠΈ  Β·  좜처: alteryx/evalml

μš°λ¦¬λŠ” ν˜„μž¬ λ‹¨μœ„ ν…ŒμŠ€νŠΈκ°€ 6μ‹œκ°„μ˜ GH Actions μ œν•œμ— λ„λ‹¬ν•˜λŠ” 것을 보고 μžˆμŠ΅λ‹ˆλ‹€. 이것은 λͺ…λ°±ν•œ 이유둜 쒋지 μ•ŠμŠ΅λ‹ˆλ‹€.

3.8 μ½”μ–΄ 뎁슀 6μ‹œκ°„ μ œν•œ μ‹œκ°„(진행 쀑)
build_conda_pkg, 3.8 μ½”μ–΄ 뎁슀, 3.7 λΉ„μ½”μ–΄ 뎁슀 6μ‹œκ°„ μ œν•œ μ‹œκ°„(진행 쀑)
3.7 λΉ„μ½”μ–΄ deps 6μ‹œκ°„ μ œν•œ μ‹œκ°„
3.8 λΉ„μ½”μ–΄ deps 6μ‹œκ°„ μ œν•œ μ‹œκ°„
3.7 λΉ„μ½”μ–΄ 깊이 1.5μ‹œκ°„
build_conda_pkg
3.7 λΉ„μ½”μ–΄ 뎁슀
3.8

blocker bug testing

κ°€μž₯ μœ μš©ν•œ λŒ“κΈ€

이제 build_conda_pkgμ—μ„œ λ‹€μŒ μŠ€νƒ 좔적을 보고 μžˆμŠ΅λ‹ˆλ‹€.

[gw3] linux -- Python 3.7.10 $PREFIX/bin/python

X_y_binary_cls = (          0         1         2   ...        17        18        19
0  -0.039268  0.131912 -0.211206  ...  1.976989  ...ns], 0     0
1     0
2     1
3     1
4     1
     ..
95    1
96    1
97    1
98    1
99    0
Length: 100, dtype: int64)
cluster = LocalCluster(15c4b3ad, 'tcp://127.0.0.1:45201', workers=0, threads=0, memory=0 B)

    def test_submit_training_jobs_multiple(X_y_binary_cls, cluster):
        """Test that training multiple pipelines using the parallel engine produces the
        same results as the sequential engine."""
        X, y = X_y_binary_cls
        with Client(cluster) as client:
            pipelines = [
                BinaryClassificationPipeline(
                    component_graph=["Logistic Regression Classifier"],
                    parameters={"Logistic Regression Classifier": {"n_jobs": 1}},
                ),
                BinaryClassificationPipeline(component_graph=["Baseline Classifier"]),
                BinaryClassificationPipeline(component_graph=["SVM Classifier"]),
            ]

            def fit_pipelines(pipelines, engine):
                futures = []
                for pipeline in pipelines:
                    futures.append(
                        engine.submit_training_job(
                            X=X, y=y, automl_config=automl_data, pipeline=pipeline
                        )
                    )
                results = [f.get_result() for f in futures]
                return results

            # Verify all pipelines are trained and fitted.
            seq_pipelines = fit_pipelines(pipelines, SequentialEngine())
            for pipeline in seq_pipelines:
                assert pipeline._is_fitted

            # Verify all pipelines are trained and fitted.
>           par_pipelines = fit_pipelines(pipelines, DaskEngine(client=client))

evalml/tests/automl_tests/dask_tests/test_dask_engine.py:103: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in fit_pipelines
    results = [f.get_result() for f in futures]
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in <listcomp>
    results = [f.get_result() for f in futures]
evalml/automl/engine/dask_engine.py:30: in get_result
    return self.work.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Future: cancelled, key: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c>
timeout = None

    def result(self, timeout=None):
        """Wait until computation completes, gather result to local process.

        If *timeout* seconds are elapsed before returning, a
        ``dask.distributed.TimeoutError`` is raised.
        """
        if self.client.asynchronous:
            return self.client.sync(self._result, callback_timeout=timeout)

        # shorten error traceback
        result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
        if self.status == "error":
            typ, exc, tb = result
            raise exc.with_traceback(tb)
        elif self.status == "cancelled":
>           raise result
E           concurrent.futures._base.CancelledError: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c

이것은 dask https://github.com/dask/distributed/issues/4612 의 μ•Œλ €μ§„ 문제인 것 κ°™μŠ΅λ‹ˆλ‹€.

λͺ¨λ“  11 λŒ“κΈ€

이 3.8 μ½”μ–΄ deps μ—μ„œ 일뢀 데이터λ₯Ό μΆ”κ°€ν•˜κΈ°λ§Œ ν•˜λ©΄ 일련의 검사가

github_unittests.txt

λ‚΄κ°€ μ•Œμ•„μ°¨λ¦° ν•œ κ°€μ§€λŠ” 그듀이 λͺ¨λ‘ 91-93% μ™„λ£Œ ν‘œμ‹œ μ£Όμœ„μ—μ„œ μΌμ‹œ μ€‘μ§€ν•˜κ³  μžˆλ‹€λŠ” κ²ƒμž…λ‹ˆλ‹€. λ‚˜λŠ” 그것이 μ–΄λ–€ ν…ŒμŠ€νŠΈμΈμ§€ μ•Œμ•„λ‚΄λŠ” 데 κ°€μΉ˜κ°€ μžˆλŠ”μ§€ μ˜μ‹¬ μŠ€λŸ½μ§€λ§Œ 좔ꡬ해야 ν•  경둜 일 수 μžˆμŠ΅λ‹ˆλ‹€.

λ‹€μŒ 은 3.9개의 λΉ„μ½”μ–΄ dep에 λŒ€ν•œ 또 λ‹€λ₯Έ κ²ƒμž…λ‹ˆλ‹€.

github_unittests_2.txt

@chukarsten을 μ œμΆœν•΄ μ£Όμ…”μ„œ κ°μ‚¬ν•©λ‹ˆλ‹€.

κ³ λ§™κ²Œλ„ condaλ₯Ό μ›μΈμœΌλ‘œ λ°°μ œν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λŠ” build_conda_pkg 뿐만 μ•„λ‹ˆλΌ 일반적인 λ‹¨μœ„ ν…ŒμŠ€νŠΈ λΉŒλ“œμ—μ„œ λ°œμƒν•˜κΈ° λ•Œλ¬Έμž…λ‹ˆλ‹€.

이λ₯Ό νŒŒμ•…ν•˜λŠ” 데 도움이 될 수 μžˆλŠ” μˆ˜μ§‘ν•΄μ•Ό ν•˜λŠ” λ‹€λ₯Έ 정보가 μžˆμŠ΅λ‹ˆκΉŒ? μ•„λž˜ λͺ‡ 가지 아이디어

  • μ‹œκ°„ 초과λ₯Ό μ–Όλ§ˆλ‚˜ μ•ˆμ •μ μœΌλ‘œ μž¬ν˜„ν•  수 μžˆμŠ΅λ‹ˆκΉŒ? λ‹¨μœ„ ν…ŒμŠ€νŠΈ μž‘μ—…μ„ μ‹€ν–‰ν•  λ•Œ 50%κ°€ λ°œμƒν•©λ‹ˆκΉŒ?
  • μ–΄λ–€ ν…ŒμŠ€νŠΈκ°€ μ œλŒ€λ‘œ μ™„λ£Œλ˜μ§€ μ•ŠμŠ΅λ‹ˆκΉŒ? pytestκ°€ 각 ν…ŒμŠ€νŠΈμ˜ μ‹œμž‘κ³Ό 끝을 κΈ°λ‘ν•˜λ„λ‘ ν•  수 μžˆλ‹€λ©΄ 둜그λ₯Ό 보고 쀑단이 λ°œμƒν•  λ•Œ μ’…λ£Œλ˜μ§€ μ•Šμ€ ν…ŒμŠ€νŠΈλ₯Ό μΆ”λ‘ ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 이것은 잠재적으둜 μœ μš©ν•΄ λ³΄μ˜€μŠ΅λ‹ˆλ‹€.
  • pytest 병렬화 없이 ν…ŒμŠ€νŠΈλ₯Ό μ‹€ν–‰ν•˜λ©΄ μ΄λŸ¬ν•œ μ‹œκ°„ μ΄ˆκ³Όκ°€ 계속 ν‘œμ‹œλ©λ‹ˆκΉŒ?
  • 이것은 단지 μ§κ°μ΄μ§€λ§Œ dask 엔진 ν…ŒμŠ€νŠΈλ₯Ό λΉ„ν™œμ„±ν™”ν•˜λ©΄ μ–΄λ–»κ²Œ λ κΉŒμš”? λ‚˜λŠ” μš°λ¦¬κ°€ 졜근 #2341κ³Ό ν•¨κ»˜ μ•½κ°„μ˜ ν”Œλ ˆμ΄ν¬λ₯Ό λ³Έ 것을 μ•Œκ³  μžˆμŠ΅λ‹ˆλ‹€.
  • ν…ŒμŠ€νŠΈκ°€ μ‹€ν–‰λ˜λŠ” λ™μ•ˆ CPU 및 λ©”λͺ¨λ¦¬ μ‚¬μš©λ₯ μ€ μ–΄λ–»κ²Œ λ©λ‹ˆκΉŒ?

( @freddyaboulton #2298 및 #1815 에 μ—°κ²°λ˜κΈ° λ•Œλ¬Έμ— 여기에 μΆ”κ°€ν–ˆμŠ΅λ‹ˆλ‹€ )

pytest둜 μžμ„Έν•œ λ‘œκΉ…μ„ μˆ˜ν–‰ν•˜λ„λ‘ Makefile을 λ³€κ²½ν•˜λ©΄ λ‹€μŒ λ‘œκ·Έκ°€ ν‘œμ‹œλ©λ‹ˆλ‹€.
. 이것은 λ§ˆμ§€λ§‰μœΌλ‘œ μ‹€ν–‰λœ ν…ŒμŠ€νŠΈκ°€ "evalml/tuners/random_search_tuner.py::evalml.tuners.random_search_tuner.RandomSearchTuner"μž„μ„ λ³΄μ—¬μ€λ‹ˆλ‹€.

μ‹œκ°„ 초과λ₯Ό μΆ”κ°€ν•œ ν›„ test_dask_sends_woodwork_schema μ—μ„œ λ™μΌν•œ μ‹œκ°„ 초과λ₯Ό 적어도 μ„Έ 번 λ³΄μ•˜μŠ΅λ‹ˆλ‹€.

  1. https://github.com/alteryx/evalml/pull/2374/checks?check_run_id=2804775673
  2. https://github.com/alteryx/evalml/pull/2374/checks?check_run_id=2804202831#step :9:92
  3. https://github.com/alteryx/evalml/runs/2804668851?check_suite_focus=true

λ‚˜λŠ” @freddyaboulton 이 ν™•μ‹€νžˆ 여기에 λ­”κ°€ μžˆλ‹€κ³  μƒκ°ν•˜κ³  μš°λ¦¬λŠ” Daskλ₯Ό ν™•κ³ ν•˜κ²Œ 가리킀고 μžˆμŠ΅λ‹ˆλ‹€. dask λ‹¨μœ„ ν…ŒμŠ€νŠΈλ₯Ό λΆ„λ¦¬ν•˜κΈ° μœ„ν•΄ 이 PR 을 λ§Œλ“€μ—ˆμŠ΅λ‹ˆλ‹€. λ‚˜λŠ” 그듀이 μ‹€νŒ¨ν•  λ•Œ 병합을 λ°©μ§€ν•˜μ§€ μ•Šμ„ 수 μžˆλŠ” μ˜΅μ…˜μ΄ μžˆλ‹€κ³  μƒκ°ν•©λ‹ˆλ‹€. 이 PR은 μ—¬μ „νžˆ ​​dask ν…ŒμŠ€νŠΈ 배열에 μžˆλŠ” test_automl_immediate_quitμ—μ„œ μ‹€νŒ¨ν–ˆμŠ΅λ‹ˆλ‹€.

dask λ‹¨μœ„ ν…ŒμŠ€νŠΈ μ‹€νŒ¨μ˜ κ·Όλ³Έ 원인을 μ‘°μ‚¬ν•˜λŠ” 것은 μˆ˜μˆ˜κ»˜λΌμž…λ‹ˆλ‹€. λ‘œκ·ΈλŠ” λ‹€μŒμ„ 많이 μƒμ„±ν•©λ‹ˆλ‹€.

distributed.worker - WARNING - Could not find data: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:45587']} on workers: [] (who_has: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:45587']})
distributed.scheduler - WARNING - Communication failed during replication: {'status': 'missing-data', 'keys'

μ™œ 이런 일이 λ°œμƒν•©λ‹ˆκΉŒ? κΈ€μŽ„μš”, 데이터가 μ²˜λ¦¬λ˜λŠ” κ³³λ§ˆλ‹€ ν•΄λ‹Ή 데이터에 λŒ€ν•œ μ°Έμ‘°κ°€ μ†μ‹€λ˜λŠ” 것 κ°™μŠ΅λ‹ˆλ‹€ . λ˜ν•œ 'workers: []'λŠ” 유λͺ¨ ν”„λ‘œμ„ΈμŠ€κ°€ μž‘μ—…μžλ₯Ό 죽이고 μžˆμŒμ„ μ•”μ‹œν•©λ‹ˆλ‹€. 데이터가 λΆ„μ‚°λ˜λŠ” 방식에 λ¬Έμ œκ°€ μžˆλŠ” 것 κ°™μ§€λ§Œ μ˜μ‚¬ 병렬/직렬둜 ν•¨κ»˜ μ‹€ν–‰λ˜λŠ” 이 λ„€ 가지 μž‘μ—…μœΌλ‘œ 인해 무슨 일이 μΌμ–΄λ‚˜κ³  μžˆλŠ”μ§€ μ˜μ‹¬μŠ€λŸ½μŠ΅λ‹ˆλ‹€.

이 dask λΆ„μ‚° 문제 λŠ” ν΄λŸ¬μŠ€ν„°μ— λŒ€ν•œ 적응 ν™•μž₯을 λΉ„ν™œμ„±ν™”ν•˜λŠ” 것을 μ œμ•ˆν•©λ‹ˆλ‹€. λΆˆν–‰νžˆλ„ μ μ‘ν˜• ν΄λŸ¬μŠ€ν„°λŠ” μ‚¬μš©ν•˜μ§€ μ•Šκ³  일반 둜컬 정적 ν΄λŸ¬μŠ€ν„°λ§Œ μ‚¬μš©ν•˜λ―€λ‘œ λ¬Έμ œκ°€ λ˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. 이 문제 λŠ” 데이터 뢄산을

dask μž‘μ—…μ„ λΆ„λ¦¬ν•˜κΈ° μœ„ν•΄ #2376을 μ‹œλ„ν•˜κ³  DaskEngine의 ν΄λΌμ΄μ–ΈνŠΈμ— λŒ€ν•΄ broadcast=False λ₯Ό μ„€μ •ν•œ ν›„ 기본적으둜 test_automl_immediate_quit에 비정상적 ν…ŒμŠ€νŠΈ μ‹€νŒ¨κ°€ μžˆμŠ΅λ‹ˆλ‹€. 여기에 λ¬Έμ„œν™”λ˜μ–΄

이제 build_conda_pkgμ—μ„œ λ‹€μŒ μŠ€νƒ 좔적을 보고 μžˆμŠ΅λ‹ˆλ‹€.

[gw3] linux -- Python 3.7.10 $PREFIX/bin/python

X_y_binary_cls = (          0         1         2   ...        17        18        19
0  -0.039268  0.131912 -0.211206  ...  1.976989  ...ns], 0     0
1     0
2     1
3     1
4     1
     ..
95    1
96    1
97    1
98    1
99    0
Length: 100, dtype: int64)
cluster = LocalCluster(15c4b3ad, 'tcp://127.0.0.1:45201', workers=0, threads=0, memory=0 B)

    def test_submit_training_jobs_multiple(X_y_binary_cls, cluster):
        """Test that training multiple pipelines using the parallel engine produces the
        same results as the sequential engine."""
        X, y = X_y_binary_cls
        with Client(cluster) as client:
            pipelines = [
                BinaryClassificationPipeline(
                    component_graph=["Logistic Regression Classifier"],
                    parameters={"Logistic Regression Classifier": {"n_jobs": 1}},
                ),
                BinaryClassificationPipeline(component_graph=["Baseline Classifier"]),
                BinaryClassificationPipeline(component_graph=["SVM Classifier"]),
            ]

            def fit_pipelines(pipelines, engine):
                futures = []
                for pipeline in pipelines:
                    futures.append(
                        engine.submit_training_job(
                            X=X, y=y, automl_config=automl_data, pipeline=pipeline
                        )
                    )
                results = [f.get_result() for f in futures]
                return results

            # Verify all pipelines are trained and fitted.
            seq_pipelines = fit_pipelines(pipelines, SequentialEngine())
            for pipeline in seq_pipelines:
                assert pipeline._is_fitted

            # Verify all pipelines are trained and fitted.
>           par_pipelines = fit_pipelines(pipelines, DaskEngine(client=client))

evalml/tests/automl_tests/dask_tests/test_dask_engine.py:103: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in fit_pipelines
    results = [f.get_result() for f in futures]
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in <listcomp>
    results = [f.get_result() for f in futures]
evalml/automl/engine/dask_engine.py:30: in get_result
    return self.work.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Future: cancelled, key: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c>
timeout = None

    def result(self, timeout=None):
        """Wait until computation completes, gather result to local process.

        If *timeout* seconds are elapsed before returning, a
        ``dask.distributed.TimeoutError`` is raised.
        """
        if self.client.asynchronous:
            return self.client.sync(self._result, callback_timeout=timeout)

        # shorten error traceback
        result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
        if self.status == "error":
            typ, exc, tb = result
            raise exc.with_traceback(tb)
        elif self.status == "cancelled":
>           raise result
E           concurrent.futures._base.CancelledError: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c

이것은 dask https://github.com/dask/distributed/issues/4612 의 μ•Œλ €μ§„ 문제인 것 κ°™μŠ΅λ‹ˆλ‹€.

λ‚΄ 였래된 κ²Œμ‹œλ¬Όμ„ μ‚­μ œν–ˆμ§€λ§Œ 여기에 빨간색 κ²Œμ‹œλ¬Όμ΄ μžˆμŠ΅λ‹ˆλ‹€. https://github.com/alteryx/evalml/actions/runs/939673304 , μœ„μ— κ²Œμ‹œλœ @freddyaboulton κ³Ό λ™μΌν•œ μŠ€νƒ 좔적인 것 κ°™μŠ΅λ‹ˆλ‹€.

λ‚˜λŠ” 이 λ¬Έμ œκ°€ 더 이상 [이 PR] λ‹Ή dask μž‘μ—…μ„ λΆ„λ¦¬ν•˜κΈ° μœ„ν•΄ μ°¨λ‹¨ν•˜μ§€ μ•ŠλŠ”λ‹€κ³  μƒκ°ν•©λ‹ˆλ‹€(https://github.com/alteryx/evalml/pull/2376), 이 PR 은 쑰각을 쀄이기 μœ„ν•΄ dask μž‘μ—…μ„ λ¦¬νŒ©ν† λ§ν•˜κ³ , 이 PR 은 λ³„λ„μ˜ dask μž‘μ—… 이 메인에 λ³‘ν•©ν•˜κΈ° μœ„ν•΄ μ°¨λ‹¨λ˜μ§€ μ•Šλ„λ‘ ν•˜κ³ 

dask κ΄€λ ¨ μ‹œκ°„ μ΄ˆκ³ΌλŠ” 이제 더 이상 λ¬Έμ œκ°€ μ•„λ‹ˆλ©° κ°€κΉŒμš΄ μž₯λž˜μ— μžˆμ–΄μ„œλŠ” μ•ˆ 되기 λ•Œλ¬Έμ— 이것을 λ‹«νž˜μœΌλ‘œ μ΄λ™ν•©λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ 근본적인 원인은 μ—¬μ „νžˆ β€‹β€‹μ•Œλ €μ Έ μžˆμ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

이 νŽ˜μ΄μ§€κ°€ 도움이 λ˜μ—ˆλ‚˜μš”?
0 / 5 - 0 λ“±κΈ‰