Libelektra: Build Jobs: Parts of Test Suite Fail Regularly

Created on 25 Feb 2019  ·  14Comments  ·  Source: ElektraInitiative/libelektra

Description

I opened this as an issue to keep track all of the temporary test failures in one of the build jobs. The main reasons for the build failures are

. In a recent PR I had to restart the Jenkins build job 5 times before everything worked. In the PR after that I restarted the Jenkins build job thrice, as far as I can remember. Anyway, the failure rate is much too high in my opinion.

Failures

| Location | Failed Tests | Build Job |
|----------|-------------|-----------|
| master | testmod_gpgme (1) | debian-stable-full |
| master | testmod_gpgme (1), testmod_zeromqsend (1) | debian-stable-full-ini |
| master | testmod_crypto_botan (1), testmod_fcrypt (1), testmod_gpgme (2), testmod_zeromqsend (1) | debian-stable-full-mmap |
| master | testmod_crypto_botan (1), testmod_fcrypt (2) | debian-unstable-full |
| master | testmod_crypto_botan (2), testmod_crypto_openssl (3), testmod_fcrypt (1) | debian-unstable-full-clang |
| PR #2442 | testmod_crypto_openssl (1), testmod_gpgme (1) | debian-stable-full-ini |
| PR #2442 | testmod_crypto_openssl (1), testmod_crypto_botan (1), testmod_fcrypt (1), testmod_gpgme (3) | debian-stable-full-mmap |
| PR #2442 | testmod_crypto_openssl (1), testmod_fcrypt (1) | debian-unstable-full |
| PR #2442 | testmod_crypto_openssl (1), testmod_crypto_botan (1), testmod_fcrypt (1) | debian-unstable-full-clang |
| PR #2442 | testmod_dbus (1), testmod_dbusrecv (1) | 🍎 MMap |
| PR #2443 | testmod_crypto_botan (1), testmod_fcrypt (1) | debian-unstable-full |
| PR #2443 | testmod_crypto_openssl(1), testmod_crypto_botan (1) | debian-unstable-full-clang |
| PR #2443 | testmod_dbus (1), testmod_dbusrecv (1) | 🍎 MMap |
| PR #2445 | testmod_crypto_openssl (1), testmod_crypto_botan (1), testmod_fcrypt (1) | debian-stable-full-ini |
| PR #2445 | testmod_crypto_openssl (2), testmod_crypto_botan (2), testmod_fcrypt (2), testmod_gpgme (1) | debian-stable-full-mmap |
| PR #2445 | testmod_crypto_openssl (2), testmod_fcrypt (2) | debian-unstable-full |
| PR #2445 | testmod_dbus (1), testmod_dbusrecv (1) | 🍏 GCC |

bug build

Most helpful comment

I now implemented automatic retry of ctest in #3224. If you still experience temporary failures of the test suites please reopen the issue. (We can increase the number of tries.)

For other failures of Jenkins/Docker, we need to find other solutions but first we finally need to do the migration. So please continue to restart the job in these cases.

All 14 comments

Thank you for your summary of these problems!

Is it maybe possible to disable the jobs only at the places where they are failing?

For the crypto and fcrypt plugin @mpranj pointed out that gpg-agent may fail in case of high server load. Maybe we could create a separate build job for the crypto and fcrypt plugin tests? So that other developments are not being blocked.

Thank you for your input!

Separating the problematic jobs might make the rebuild cycles shorter. But I think it is clear that we do not want any manual rebuilds at all. So we have the options:

  • making it more reliable
  • some automatic loops which retry on such errors
  • disabling the tests (when someone works on these parts, she needs to activate them again)

What do you think?

  • making it more reliable

hardly possible as long as we utilize gpg-agent (which is a pain in batch jobs)

  • some automatic loops which retry on such errors

This feels dirty to me.

  • disabling the tests (when someone works on these parts, she needs to activate them again)

Seems to be the option that causes the least discomfort, although having manual regression tests is not nice either.

As discussed in the meeting: we should disable the tests.

Alternative also discussed in the meeting: Using ctest --rerun-failed

Running ctest creates the file <cmake_build_dir>/Testing/Temporary/LastTestsFailed[_timestamp].log (the timestamp is only used in Dashboard mode). This file is also used by ctest --rerun-failed (see Kitware/CMake@eb2decc02d28f41a3e189d5387be24552c42060f). It simply contains the numbers and names of the tests that last failed.

My proposal would to call ctest as before. If if exits unsuccessfully, use grep on LastTestsFailed.log to check if one of the tests listed above failed. And only then use ctest --rerun-failed. This causes less duplicate/confusing output.

But if the problem really is high server load that won't help much. Instead we could try ctest --test-load. This should cause ctest to keep CPU load below a certain threshold.

IMO still the best option would be to disable the tests and create a small build job that only installs the dependencies needed by these plugins/libraries, only compiles what is necessary and only run the problematic tests. That way we could get the runtime probably done to a few minutes, in which case manual restarting would be acceptable I think. For comparison our FreeBSD jobs currently take about 10 min (7 min build, 2 min test, 1 min other) to run ~200 tests.

PS. Not sure about our setup, but restarting a jenkins pipeline from a certain stage should be possible

Alternative also discussed in the meeting: Using ctest --rerun-failed

Thank you for looking into it!

But if the problem really is high server load that won't help much. Instead we could try ctest --test-load.

@ingwinlu did a lot of work in this direction. Our servers have the highest throughput with high load. I.e. we would slow down our tests with such options.

IMO still the best option would be to disable the tests and create a small build job that only installs the

Modular test cases is very difficult to achive and maintain. @ingwinlu put a lot of work into it. I think we cannot put this effort again only for a few unreliable tests.

PS. Not sure about our setup, but restarting a jenkins pipeline from a certain stage should be possible

That would be great. But I do not see the restart button in our GUI. Do we need another plugin or a newer version? @ingwinlu tried to add "jenkins build * please" for all pipeline steps, unfortunately, it did not work.

It seems like we still have failures (dbus see #2532)

What about excluding the dbus test cases for the Mac builds?

It seems like we still have failures (dbus see #2532)

Yes, we do.

gcc --version

Configured with: --prefix=/Applications/Xcode-10.2.1.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode-10.2.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1

Apple LLVM version 10.0.1 (clang-1001.0.46.4)

Target: x86_64-apple-darwin18.5.0

Thread model: posix

InstalledDir: /Applications/Xcode-10.2.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

(...)

DBUSRECV TESTS

==============

testing prerequisites

detecting available bus types - please ignore single error messages prefixed with "connect:"

connect: Failed to open connection to system message bus: Failed to connect to socket /usr/local/var/run/dbus/system_bus_socket: No such file or directory

test commit

test adding keys

../src/plugins/dbusrecv/testmod_dbusrecv.c:228: error in test_keyAdded: string "system/tests/testmod_dbusrecv/added" is not equal to "user/tests/foo/bar"

    compared: expectedKeyName and keyName (test_callbackKey)

test adding keys

testmod_dbusrecv Results: 34 Tests done — 1 error.

Were you able to reproduce it locally?

We still do not know why this problem sporadically occurs. If you have any input, it would be great.

Maybe we can simply exclude the tests from the problematic build jobs? Or do the dbus* testcases fail on every build job where it runs?

Were you able to reproduce it locally?

Unfortunately not. I'm on Ubuntu.

Maybe we can simply exclude the tests from the problematic build jobs? Or do the dbus* testcases fail on every build job where it runs?

I just restarted the build job to see if it happens again.

Please re-assign me if neccessary.

I now implemented automatic retry of ctest in #3224. If you still experience temporary failures of the test suites please reopen the issue. (We can increase the number of tries.)

For other failures of Jenkins/Docker, we need to find other solutions but first we finally need to do the migration. So please continue to restart the job in these cases.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ingwinlu picture ingwinlu  ·  36Comments

mpranj picture mpranj  ·  17Comments

markus2330 picture markus2330  ·  24Comments

sanssecours picture sanssecours  ·  20Comments

sanssecours picture sanssecours  ·  11Comments