Evalml: مهلة اختبار الوحدة (عدم استقرار Dask)

تم إنشاؤها على ٩ يونيو ٢٠٢١ · 11تعليقات · مصدر: alteryx/evalml

نحن نرى حاليًا اختبارات الوحدة تنتقل إلى حد إجراءات GH البالغ 6 ساعات. هذا ليس جيدا لأسباب واضحة.

3.8 قسم أساسي مهلة 6 ساعات (قيد التقدم)
build_conda_pkg، 3.8 deps core، 3.7 non-core deps 6 hr timeout (in progress)
3.7 مهلة
3.8 قسم غير أساسي مهلة 6 ساعات
3.7 أقسام غير أساسية 1.5 ساعة
build_conda_pkg
3.7 أقسام غير أساسية
3.8

blocker bug testing

مصدر

chukarsten

التعليق الأكثر فائدة

أرى الآن ترتيب المكدس التالي في build_conda_pkg

[gw3] linux -- Python 3.7.10 $PREFIX/bin/python

X_y_binary_cls = (          0         1         2   ...        17        18        19
0  -0.039268  0.131912 -0.211206  ...  1.976989  ...ns], 0     0
1     0
2     1
3     1
4     1
     ..
95    1
96    1
97    1
98    1
99    0
Length: 100, dtype: int64)
cluster = LocalCluster(15c4b3ad, 'tcp://127.0.0.1:45201', workers=0, threads=0, memory=0 B)

    def test_submit_training_jobs_multiple(X_y_binary_cls, cluster):
        """Test that training multiple pipelines using the parallel engine produces the
        same results as the sequential engine."""
        X, y = X_y_binary_cls
        with Client(cluster) as client:
            pipelines = [
                BinaryClassificationPipeline(
                    component_graph=["Logistic Regression Classifier"],
                    parameters={"Logistic Regression Classifier": {"n_jobs": 1}},
                ),
                BinaryClassificationPipeline(component_graph=["Baseline Classifier"]),
                BinaryClassificationPipeline(component_graph=["SVM Classifier"]),
            ]

            def fit_pipelines(pipelines, engine):
                futures = []
                for pipeline in pipelines:
                    futures.append(
                        engine.submit_training_job(
                            X=X, y=y, automl_config=automl_data, pipeline=pipeline
                        )
                    )
                results = [f.get_result() for f in futures]
                return results

            # Verify all pipelines are trained and fitted.
            seq_pipelines = fit_pipelines(pipelines, SequentialEngine())
            for pipeline in seq_pipelines:
                assert pipeline._is_fitted

            # Verify all pipelines are trained and fitted.
>           par_pipelines = fit_pipelines(pipelines, DaskEngine(client=client))

evalml/tests/automl_tests/dask_tests/test_dask_engine.py:103: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in fit_pipelines
    results = [f.get_result() for f in futures]
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in <listcomp>
    results = [f.get_result() for f in futures]
evalml/automl/engine/dask_engine.py:30: in get_result
    return self.work.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Future: cancelled, key: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c>
timeout = None

    def result(self, timeout=None):
        """Wait until computation completes, gather result to local process.

        If *timeout* seconds are elapsed before returning, a
        ``dask.distributed.TimeoutError`` is raised.
        """
        if self.client.asynchronous:
            return self.client.sync(self._result, callback_timeout=timeout)

        # shorten error traceback
        result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
        if self.status == "error":
            typ, exc, tb = result
            raise exc.with_traceback(tb)
        elif self.status == "cancelled":
>           raise result
E           concurrent.futures._base.CancelledError: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c

يبدو أن هذه مشكلة معروفة في dask https://github.com/dask/distributed/issues/4612

freddyaboulton في ١٥ يونيو ٢٠٢١

👍2

ال 11 كومينتر

مجرد إضافة بعض البيانات من 3.8 قسم أساسي ، يتم تشغيل سلسلة من عمليات الفحص. إضافة السجلات من ذلك المدى.

github_unittests.txt

أعتقد أن شيئًا واحدًا لاحظته هو أنهم جميعًا يتوقفون مؤقتًا حول علامة 91-93٪ المكتملة. أشك في وجود أي قيمة لمعرفة أي من هذه الاختبارات ، ولكن قد يكون هذا طريقًا للمتابعة.

chukarsten في ١٠ يونيو ٢٠٢١

هنا آخر 3.9 قسم غير أساسي.

github_unittests_2.txt

chukarsten في ١٠ يونيو ٢٠٢١

شكرا لتقديم chukarsten

لحسن الحظ ، يمكننا استبعاد كوندا كسبب ، نظرًا لأن هذا يحدث لبنيات اختبار الوحدة العادية وليس فقط مقابل build_conda_pkg

هل هناك أي معلومات أخرى يجب أن نجمعها والتي يمكن أن تساعدنا في معرفة ذلك؟ بعض الأفكار أدناه

إلى أي مدى يمكن الاعتماد على الوقت المستقطع؟ هل يحدث 50٪ من الوقت الذي نجري فيه وظيفة اختبار الوحدة ، أكثر ، أقل؟
ما هو الاختبار أو الاختبارات التي لم تكتمل بشكل صحيح؟ إذا تمكنا من الحصول على pytest لتسجيل بداية ونهاية كل اختبار ، فيمكننا إلقاء نظرة على السجلات واستنتاج الاختبار الذي لم ينته عند حدوث التعليق. بدا هذا مفيدًا على الأرجح.
هل ما زلنا نرى هذه المهلات إذا أجرينا اختبارات دون أي موازاة في اختبار pytest؟
هذا مجرد حدس ، لكن ماذا يحدث إذا قمنا بتعطيل اختبارات محرك داسك؟ أعلم أننا رأينا بعض الرقائق مع تلك # 2341 مؤخرًا
كيف يبدو استخدام وحدة المعالجة المركزية والذاكرة أثناء إجراء الاختبارات؟

dsherry في ١٠ يونيو ٢٠٢١

( freddyaboulton لقد

dsherry في ١٠ يونيو ٢٠٢١

تغيير Makefile للقيام بالتسجيل المطول مع pytest ، نحصل على السجل التالي
. يُظهر هذا آخر اختبار تم تنفيذه ليكون "Evalml / tuners / random_search_tuner.py :: Evalml.tuners.random_search_tuner.RandomSearchTuner"

chukarsten في ١١ يونيو ٢٠٢١

بعد إضافة المهلة ، رأيت نفس المهلة على test_dask_sends_woodwork_schema ثلاث مرات على الأقل:

freddyaboulton في ١١ يونيو ٢٠٢١

أعتقد أن freddyaboulton هو بالتأكيد شيء ما هنا ونحن نشير بقوة إلى Dask. جعل هذا العلاقات العامة لفصل اختبارات وحدة داس. أعتقد أن لدينا خيار عدم منع الاندماج عند فشلها. فشل هذا PR في test_automl_immediate_quit ، والذي لا يزال في مجموعة اختبارات dask.

إن النظر في السبب الجذري لفشل اختبار وحدة dask أمر محير. تولد السجلات الكثير من هذا:

distributed.worker - WARNING - Could not find data: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:45587']} on workers: [] (who_has: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:45587']})
distributed.scheduler - WARNING - Communication failed during replication: {'status': 'missing-data', 'keys'

لماذا يحدث هذا؟ حسنًا ، يبدو أنه أينما كانت البيانات التي يتم التعامل معها تفقد الإشارة إلى تلك البيانات. بالإضافة إلى ذلك ، فإن "العمال: []" يشيرون إلى أن عملية المربية ربما تقتل العمال. أظن أن هناك شيئًا ما يحدث بشأن كيفية تشتت البيانات ولكني أيضًا أشك في ما يحدث تحت الأغطية مع تشغيل هذه الوظائف الأربع معًا في متوازي / متسلسل زائف.

تقترح هذه المشكلة الموزعة dask تعطيل التحجيم التكيفي للكتلة. لسوء الحظ ، نحن لا نستخدم مجموعات تكيفية ، بل مجرد مجموعات ثابتة محلية منتظمة ، لذلك ليست هذه هي المشكلة. تشير هذه المشكلة إلى تشتت البيانات كسبب محتمل للمشكلة ، حيث يتم التخلي عن العمال ، لكننا لا نحصل على نفس أخطاء الاتصال.

chukarsten في ١٤ يونيو ٢٠٢١

بعد محاولة # 2376 لفصل مهام dask وتعيين broadcast=False لعميل DaskEngine ، بشكل افتراضي ، لدي فشل اختبار غير مستقر مع test_automl_immediate_quit. موثق هنا .

chukarsten في ١٤ يونيو ٢٠٢١

أرى الآن ترتيب المكدس التالي في build_conda_pkg

[gw3] linux -- Python 3.7.10 $PREFIX/bin/python

X_y_binary_cls = (          0         1         2   ...        17        18        19
0  -0.039268  0.131912 -0.211206  ...  1.976989  ...ns], 0     0
1     0
2     1
3     1
4     1
     ..
95    1
96    1
97    1
98    1
99    0
Length: 100, dtype: int64)
cluster = LocalCluster(15c4b3ad, 'tcp://127.0.0.1:45201', workers=0, threads=0, memory=0 B)

    def test_submit_training_jobs_multiple(X_y_binary_cls, cluster):
        """Test that training multiple pipelines using the parallel engine produces the
        same results as the sequential engine."""
        X, y = X_y_binary_cls
        with Client(cluster) as client:
            pipelines = [
                BinaryClassificationPipeline(
                    component_graph=["Logistic Regression Classifier"],
                    parameters={"Logistic Regression Classifier": {"n_jobs": 1}},
                ),
                BinaryClassificationPipeline(component_graph=["Baseline Classifier"]),
                BinaryClassificationPipeline(component_graph=["SVM Classifier"]),
            ]

            def fit_pipelines(pipelines, engine):
                futures = []
                for pipeline in pipelines:
                    futures.append(
                        engine.submit_training_job(
                            X=X, y=y, automl_config=automl_data, pipeline=pipeline
                        )
                    )
                results = [f.get_result() for f in futures]
                return results

            # Verify all pipelines are trained and fitted.
            seq_pipelines = fit_pipelines(pipelines, SequentialEngine())
            for pipeline in seq_pipelines:
                assert pipeline._is_fitted

            # Verify all pipelines are trained and fitted.
>           par_pipelines = fit_pipelines(pipelines, DaskEngine(client=client))

evalml/tests/automl_tests/dask_tests/test_dask_engine.py:103: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in fit_pipelines
    results = [f.get_result() for f in futures]
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in <listcomp>
    results = [f.get_result() for f in futures]
evalml/automl/engine/dask_engine.py:30: in get_result
    return self.work.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Future: cancelled, key: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c>
timeout = None

    def result(self, timeout=None):
        """Wait until computation completes, gather result to local process.

        If *timeout* seconds are elapsed before returning, a
        ``dask.distributed.TimeoutError`` is raised.
        """
        if self.client.asynchronous:
            return self.client.sync(self._result, callback_timeout=timeout)

        # shorten error traceback
        result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
        if self.status == "error":
            typ, exc, tb = result
            raise exc.with_traceback(tb)
        elif self.status == "cancelled":
>           raise result
E           concurrent.futures._base.CancelledError: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c

يبدو أن هذه مشكلة معروفة في dask https://github.com/dask/distributed/issues/4612

freddyaboulton في ١٥ يونيو ٢٠٢١

👍2

حذفت رسالتي القديمة ولكن إليك رسالة حمراء: https://github.com/alteryx/evalml/actions/runs/939673304 ، يبدو أنه نفس تتبع المكدس freddyaboulton المنشور أعلاه.

angela97lin في ١٥ يونيو ٢٠٢١

أعتقد أن هذه المشكلة لم تعد تمنع كل [هذه العلاقات العامة] لفصل وظائف dask (https://github.com/alteryx/evalml/pull/2376) ، هذه العلاقات العامة لإعادة تشكيل مهام dask لتقليل الرقائق ، و هذا PR لجعل وظائف dask المنفصلة لا تمنع الدمج إلى main و PR هذا لإضافة مهلة لمنع اختبارات dask المرضية من أخذ 6 ساعات ليتم إلغاؤها في النهاية بواسطة GH Actions.

سيتم نقل هذا إلى الإغلاق لأن المهلات المتعلقة بـ dask لم تعد مشكلة الآن ولا ينبغي أن تكون في المستقبل المنظور. ومع ذلك ، لا يزال السبب الأساسي غير معروف.

chukarsten في ١٧ يونيو ٢٠٢١

هل كانت هذه الصفحة مفيدة؟

0 / 5 - 0 التقييمات

القضايا ذات الصلة

فشل build_conda_pkg على main

dsherry · 3تعليقات

قم بتحديث علامة البحث التلقائي "lift_errors" إلى الإعداد الافتراضي إلى "true"

dsherry · 4تعليقات

رسائل التحذير في اختبار الوحدة: "تمت مصادفة قيمة غير صالحة في double_scalars" وغيرها

dsherry · 3تعليقات

المستندات: السهم الخلفي في صفحة التثبيت

chukarsten · 4تعليقات

قم بتحديث خط الأنابيب والمكونات لإرجاع هياكل بيانات Woodwork

angela97lin · 5تعليقات