Libelektra: Build Jobs: Parts of Test Suite Fail Regularly

Created on 25 Feb 2019 · 14Comments · Source: ElektraInitiative/libelektra

Description

I opened this as an issue to keep track all of the temporary test failures in one of the build jobs. The main reasons for the build failures are

the tests for the notification bindings/plugins, and
the tests for the crypto plugin (this only applies to the Jenkins build server)

. In a recent PR I had to restart the Jenkins build job 5 times before everything worked. In the PR after that I restarted the Jenkins build job thrice, as far as I can remember. Anyway, the failure rate is much too high in my opinion.

Failures

| Location | Failed Tests | Build Job |
|----------|-------------|-----------|
| master | testmod_gpgme (1) | debian-stable-full |
| master | testmod_gpgme (1), testmod_zeromqsend (1) | debian-stable-full-ini |
| master | testmod_crypto_botan (1), testmod_fcrypt (1), testmod_gpgme (2), testmod_zeromqsend (1) | debian-stable-full-mmap |
| master | testmod_crypto_botan (1), testmod_fcrypt (2) | debian-unstable-full |
| master | testmod_crypto_botan (2), testmod_crypto_openssl (3), testmod_fcrypt (1) | debian-unstable-full-clang |
| PR #2442 | testmod_crypto_openssl (1), testmod_gpgme (1) | debian-stable-full-ini |
| PR #2442 | testmod_crypto_openssl (1), testmod_crypto_botan (1), testmod_fcrypt (1), testmod_gpgme (3) | debian-stable-full-mmap |
| PR #2442 | testmod_crypto_openssl (1), testmod_fcrypt (1) | debian-unstable-full |
| PR #2442 | testmod_crypto_openssl (1), testmod_crypto_botan (1), testmod_fcrypt (1) | debian-unstable-full-clang |
| PR #2442 | testmod_dbus (1), testmod_dbusrecv (1) | 🍎 MMap |
| PR #2443 | testmod_crypto_botan (1), testmod_fcrypt (1) | debian-unstable-full |
| PR #2443 | testmod_crypto_openssl(1), testmod_crypto_botan (1) | debian-unstable-full-clang |
| PR #2443 | testmod_dbus (1), testmod_dbusrecv (1) | 🍎 MMap |
| PR #2445 | testmod_crypto_openssl (1), testmod_crypto_botan (1), testmod_fcrypt (1) | debian-stable-full-ini |
| PR #2445 | testmod_crypto_openssl (2), testmod_crypto_botan (2), testmod_fcrypt (2), testmod_gpgme (1) | debian-stable-full-mmap |
| PR #2445 | testmod_crypto_openssl (2), testmod_fcrypt (2) | debian-unstable-full |
| PR #2445 | testmod_dbus (1), testmod_dbusrecv (1) | 🍏 GCC |

bug build

Source

sanssecours

👍1

Most helpful comment

I now implemented automatic retry of ctest in #3224. If you still experience temporary failures of the test suites please reopen the issue. (We can increase the number of tries.)

For other failures of Jenkins/Docker, we need to find other solutions but first we finally need to do the migration. So please continue to restart the job in these cases.

markus2330 on 17 Nov 2019

🚀1 🎉1

All 14 comments

Thank you for your summary of these problems!

Is it maybe possible to disable the jobs only at the places where they are failing?

markus2330 on 25 Feb 2019

For the crypto and fcrypt plugin @mpranj pointed out that gpg-agent may fail in case of high server load. Maybe we could create a separate build job for the crypto and fcrypt plugin tests? So that other developments are not being blocked.

petermax2 on 26 Feb 2019

Thank you for your input!

Separating the problematic jobs might make the rebuild cycles shorter. But I think it is clear that we do not want any manual rebuilds at all. So we have the options:

making it more reliable
some automatic loops which retry on such errors
disabling the tests (when someone works on these parts, she needs to activate them again)

What do you think?

markus2330 on 26 Feb 2019

making it more reliable

hardly possible as long as we utilize gpg-agent (which is a pain in batch jobs)

some automatic loops which retry on such errors

This feels dirty to me.

disabling the tests (when someone works on these parts, she needs to activate them again)

Seems to be the option that causes the least discomfort, although having manual regression tests is not nice either.

petermax2 on 27 Feb 2019

As discussed in the meeting: we should disable the tests.

markus2330 on 27 Feb 2019

👍1

Alternative also discussed in the meeting: Using ctest --rerun-failed

Running ctest creates the file <cmake_build_dir>/Testing/Temporary/LastTestsFailed[_timestamp].log (the timestamp is only used in Dashboard mode). This file is also used by ctest --rerun-failed (see Kitware/CMake@eb2decc02d28f41a3e189d5387be24552c42060f). It simply contains the numbers and names of the tests that last failed.

My proposal would to call ctest as before. If if exits unsuccessfully, use grep on LastTestsFailed.log to check if one of the tests listed above failed. And only then use ctest --rerun-failed. This causes less duplicate/confusing output.

But if the problem really is high server load that won't help much. Instead we could try ctest --test-load. This should cause ctest to keep CPU load below a certain threshold.

IMO still the best option would be to disable the tests and create a small build job that only installs the dependencies needed by these plugins/libraries, only compiles what is necessary and only run the problematic tests. That way we could get the runtime probably done to a few minutes, in which case manual restarting would be acceptable I think. For comparison our FreeBSD jobs currently take about 10 min (7 min build, 2 min test, 1 min other) to run ~200 tests.

PS. Not sure about our setup, but restarting a jenkins pipeline from a certain stage should be possible

kodebach on 27 Feb 2019

👍1

Alternative also discussed in the meeting: Using ctest --rerun-failed

Thank you for looking into it!

But if the problem really is high server load that won't help much. Instead we could try ctest --test-load.

@ingwinlu did a lot of work in this direction. Our servers have the highest throughput with high load. I.e. we would slow down our tests with such options.

IMO still the best option would be to disable the tests and create a small build job that only installs the

Modular test cases is very difficult to achive and maintain. @ingwinlu put a lot of work into it. I think we cannot put this effort again only for a few unreliable tests.

PS. Not sure about our setup, but restarting a jenkins pipeline from a certain stage should be possible

That would be great. But I do not see the restart button in our GUI. Do we need another plugin or a newer version? @ingwinlu tried to add "jenkins build * please" for all pipeline steps, unfortunately, it did not work.

markus2330 on 28 Feb 2019

It seems like we still have failures (dbus see #2532)

markus2330 on 25 Mar 2019

What about excluding the dbus test cases for the Mac builds?

markus2330 on 10 Jun 2019

👍1

It seems like we still have failures (dbus see #2532)

Yes, we do.

gcc --version

Configured with: --prefix=/Applications/Xcode-10.2.1.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode-10.2.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1

Apple LLVM version 10.0.1 (clang-1001.0.46.4)

Target: x86_64-apple-darwin18.5.0

Thread model: posix

InstalledDir: /Applications/Xcode-10.2.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

(...)

DBUSRECV TESTS

==============

testing prerequisites

detecting available bus types - please ignore single error messages prefixed with "connect:"

connect: Failed to open connection to system message bus: Failed to connect to socket /usr/local/var/run/dbus/system_bus_socket: No such file or directory

test commit

test adding keys

../src/plugins/dbusrecv/testmod_dbusrecv.c:228: error in test_keyAdded: string "system/tests/testmod_dbusrecv/added" is not equal to "user/tests/foo/bar"

    compared: expectedKeyName and keyName (test_callbackKey)

test adding keys

testmod_dbusrecv Results: 34 Tests done — 1 error.

dominicjaeger on 19 Aug 2019

Were you able to reproduce it locally?

We still do not know why this problem sporadically occurs. If you have any input, it would be great.

Maybe we can simply exclude the tests from the problematic build jobs? Or do the dbus* testcases fail on every build job where it runs?

markus2330 on 19 Aug 2019

Were you able to reproduce it locally?

Unfortunately not. I'm on Ubuntu.

Maybe we can simply exclude the tests from the problematic build jobs? Or do the dbus* testcases fail on every build job where it runs?

I just restarted the build job to see if it happens again.

dominicjaeger on 19 Aug 2019

👍1

Please re-assign me if neccessary.

petermax2 on 6 Nov 2019

❤1

I now implemented automatic retry of ctest in #3224. If you still experience temporary failures of the test suites please reopen the issue. (We can increase the number of tries.)

For other failures of Jenkins/Docker, we need to find other solutions but first we finally need to do the migration. So please continue to restart the job in these cases.

markus2330 on 17 Nov 2019

🚀1 🎉1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

kdb list documentation outdated

dominicjaeger · 3Comments

Homepage: Current Version of Homepage Does Not Reflect State of Branch `master`

sanssecours · 4Comments

code generation for errors

markus2330 · 4Comments

crypto_botan: botan 2.13 header deprecation warnings

mpranj · 3Comments

homepage: build fails

markus2330 · 4Comments