Libelektra: Jenkins: Retry Failed Builds

Created on 16 Sep 2019  ·  36Comments  ·  Source: ElektraInitiative/libelektra

Description

Currently the Jenkins build fails quite often for various reasons. This issue should list some of the problems that currently include:

  • [x] failing Maven builds,
  • [ ] failing Homepage builds,
  • [ ] [internal compiler errors](https://github.com/ElektraInitiative/libelektra/issues/2986),
  • [ ] CMake install failures,
  • [ ] workspace removal failures,
  • [x] Haskell build failures,
  • [ ] APT install failures,
  • [ ] [timeouts](https://github.com/ElektraInitiative/libelektra/issues/2984),
  • [x] failing tests,
  • [ ] [connection problems](https://github.com/ElektraInitiative/libelektra/issues/2999), and
  • [ ] Git commit failures

.

Failures

| Branch | Failure Reason | Failed Build Job/Stage |
|----------|-------------|-----------|
| PR #2932 | Maven build | debian-unstable-clang-asan |
| master | Homepage build | Deploy Website |
| master | Homepage build | Deploy Website |
| PR #2945 | Internal compiler error | build-elektra-web-base |
| master | Cmake install failure | debian-stretch-full |
| master | Workspace removal failure | Main builds |
| master | Workspace removal failure | Main builds |
| master | Workspace removal failure | Main builds |
| master | Workspace removal failure | Main builds |
| master | Workspace removal failure | Main builds |
| master | Workspace removal failure | Main builds |
| PR #2945 | Haskell build failure | debian-stretch-full-optimizations-off |
| PR #2945 | APT install failed | build-elektra-website |
| PR #2932 | Maven build | debian-unstable-clang-asan |
| master | Timeout | debian-stretch-full-mmap-asan |
| PR #2975 | Timeout | debian-buster-mingw-w64 |
| master | Homepage build | Deploy Website |
| master | Homepage build | Deploy Website |
| master | Timeout | debian-buster-full |
| master | Haskell build failure | debian-stretch-full-ini |
| master | Timeout | debian-unstable-full |
| master | Failing tests | debian-buster-full |
| master | Internal compiler error | build-elektra-web-base |
| master | Homepage build | Deploy Website |
| master | Homepage build | Deploy Website |
| master | Homepage build | Deploy Website |
| master | Homepage build | Deploy Website |
| PR #2998 | Timeout, Connection problems | build-elektra-web-base, debian-buster-full-i386 |
| master | Maven build | debian-unstable-clang-asan |
| PR #2998 | Timeout | build-elektra-website-backend |
| master | Connection problems | build-elektra-web-base |
| master | Homepage build | Deploy Website |
| master | Maven build | debian-unstable-full-clang |
| master | Git commit failure | buildPackage/debian/buster |
| master | Git commit failure | buildPackage/debian/buster |
| master | Git commit failure | buildPackage/debian/buster, buildPackage/debian/stretch |
| master | Git commit failure | buildPackage/debian/buster |
| master | Git commit failure | buildPackage/debian/buster |

Failing Tests

| Test | Location | Times Failed |
| -------------------------------------- | --------------------- | ------------ |
| check_external_example_codegen_econf | debian-buster-full | 1 |
| check_external_example_codegen_menu | debian-buster-full | 1 |
| check_external_example_codegen_tree | debian-buster-full | 1 |
| check_external_example_highlevel | debian-buster-full | 1 |
| check_spec | debian-buster-full | 1 |
| testkdb_ensure | debian-buster-full | 1 |

bug build continuous integration

Most helpful comment

It just Disk quota exceeded , I did not want to overkill it with memory. I cleaned it up now. Its up again.

All 36 comments

Thank you for collecting the issues!

For the maven builds we already have an issue: #2855

For the maven builds we already have an issue: #2855

I know 😊. I already added a link in the issue description.

Thank you for this elaborate research. We now need to fix one issue after the other.

For the Haskell problems we can remove the haskell bindings/plugins. They are not maintained anyway.

Haskell will be removed in #3017

The failures with docker pull failing in the website stage occurs quite often now.

I just got connection problems for build-elektra-web-base, too.

3d070e3209ce: Retrying in 1 second

error creating overlay mount to /home/_docker/overlay2/e9563564b9365114c47d90b7e8d307565225097a525e6b1b866a2da2877b2aa8/merged: device or resource busy

script returned exit code 1

This is a full log.

The failures with docker pull failing in the website stage occurs quite often now.

Is this all the retrying and waiting after Pulling from build-elektra-web-base (log)?

Additionally, I think this error is new: test_service_convertengine fails during Starting build/hub.libelektra.org/build-elektra-website-backend (log 2)

Yes, I agree test_service_convertengine is not reported here yet. Actually we can disable the test as the service is not modified anyway.

@sanssecours is there some procedure how to add new tests in the above list?

@sanssecours is there some procedure how to add new tests in the above list?

Nope. I already gave up on modifying the list, since the Jenkins build fails too often. I would recommend we just open an issue for each specific problem.

For issues related to source code I agree. For the issues related to docker/jenkins instability it is enough to collect issues here as it is very limited what we can do next to the migration we already do but unfortunately takes longer as expected. It would be nice if @Mistreated could give more information about the status, maybe in #160.

Additionally, I think this error is new: test_service_convertengine fails during Starting build/hub.libelektra.org/build-elektra-website-backend (log 2)

Can you please report that separately? The fix is to disable the tests.

Can you please report that separately?

Done, see #3086

I think our best guess to make our lives much easier is to "fix" these problems using https://wiki.jenkins.io/display/JENKINS/Naginator+Plugin

Then Jenkins will restart failed jobs several times. I think we could try 5 restarts before giving up?

@Mistreated Can you implement this also on the old server? Or is this too risky?

Before we implement this, however, we need the new Jenkins Node as otherwise the queue will get too long.

After a bit of struggling I managed to add a new Jenkins Node.

Before we implement this, however, we need the new Jenkins Node as otherwise the queue will get too long.

The old server is overloaded in my opinion, but we can try I guess..

After a bit of struggling I managed to add a new Jenkins Node.

Thank you for adding the new Jenkins node. I disabled the node for now, since it seems to break the build.

I updated the node. It should work now. If something goes wrong you can update me here again.

If something goes wrong you can update me here again.

Looks like docker pull fails on hetzner-jenkins1, since the node has not enough free space:

Cannot contact hetzner-jenkins1: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
failed to register layer: ApplyLayer exit status 1 stdout: stderr: write /usr/lib/git-core/git-credential-store: disk quota exceeded

.

Looks like docker pull fails on hetzner-jenkins1, since the node has not enough free space:

Node updated.

Build jobs on hetzner-jenkins1 seem to fail, because of permission related problems:

Resource: Could not create directory '/.config'. Reason: Permission denied. Identity: uid: 47000, euid: 47000, gid: 47000, egid: 47000

.

I updated the Node, again, there shouldnt be any permission issues anymore.

Why does Jenkins wants to build a '/.config' and not just '.config' directory?
There is a .config directory inside '/home/jenkins/' but he wants to make .config folder in '/'.

I dont think user 'jenkins' should be able to do that.

@Mistreated please also make a PR to actually test if the builds work now.

Why does Jenkins wants to build a '/.config' and not just '.config' directory?
There is a .config directory inside '/home/jenkins/' but he wants to make .config folder in '/'.

This might happen if the home directory of the user is /. Did you look into /etc/passwd, maybe something is wrong there?

This might happen if the home directory of the user is /. Did you look into /etc/passwd, maybe something is wrong there?

'jenkins:x:47000:47000::/home/jenkins:/bin/sh'

All looks fine, even in the logs of the node:

'HOME='/home/jenkins' '
'NOTE: Relative remote path resolved to: /home/jenkins/.'

It would be easier to debug to see a PR with the whole log.

Master node is down.

It would be easier to debug to see a PR with the whole log.

3134

Master node is down.

Thank you for the information. I deleted all log information for old pull requests and reenabled the node. Unfortunately the amount of free space on the Jenkins master is still very low (~ 3.9G).

@Mistreated I moved the discussion about the hetzner node to #3138. This issue is about temporary failures in the build server, not about wrong setup of the build server.

Looks like building Docker images does not work on hetzner-jenkins1:

stderr: error: could not lock config file .git/config: Disk quota exceeded

. I disabled the node.

It just Disk quota exceeded , I did not want to overkill it with memory. I cleaned it up now. Its up again.

Two more tests that sometimes fail (#3168):

 27/134 MemCheck  #23: testcpp_contextual_thread ........***Exception: Other  2.59 sec
Running main() from /opt/gtest/googletest/src/gtest_main.cc
[==========] Running 8 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 8 tests from test_contextual_thread
[ RUN      ] test_contextual_thread.instanciation

/home/jenkins/workspace/libelektra_PR-3168-L5JHIPUUQR3TWFGKHQIDK6HHW6QAMSQXWJC5ZUZMBLDMLTYA2ENA@2/src/bindings/cpp/tests/testcpp_contextual_thread.cpp:70: Failure

Expected equality of these values:
  ks.lookup ("user/hello").getString ()
    Which is: "8"
  "5"
terminate called without an active exception
60/254 Test  #57: testio_glib .................................***Failed    5.08 sec

BINDING TEST-SUITE

==================

test basics
test idle
test timer
testTimerShouldCallbackOnce (warning): measured 316ms, expected 250ms - deviation 66ms.
testTimerShouldCallbackAtIntervals (warning): measured 343ms, expected 250ms - deviation 93ms.
testTimerShouldCallbackAtIntervals (warning): measured 322ms, expected 250ms - deviation 72ms.
testTimerShouldCallbackAtIntervals (warning): measured 338ms, expected 250ms - deviation 88ms.
../src/bindings/io/test/test_timer.c:273: error in testTimerShouldChangeInterval: timer was not called the required amount of times
test file descriptor
test mix

Yet another error in https://build.libelektra.org/jenkins/blue/organizations/jenkins/libelektra/detail/master/12/pipeline/

Step 12/31 : RUN curl -o cppcms-${CPPCMS_VERSION}.tar.bz -L         "https://sourceforge.net/projects/cppcms/files/cppcms/${CPPCMS_VERSION}/cppcms-${CPPCMS_VERSION}.tar.bz2/download"     && tar -xjvf cppcms-${CPPCMS_VERSION}.tar.bz     && mkdir cppcms-${CPPCMS_VERSION}/build     && cd cppcms-${CPPCMS_VERSION}/build     && cmake ..     && make -j ${PARALLEL}     && make install     && cd /app/deps     && rm -Rf cppcms-${CPPCMS_VERSION}

 ---> Running in f5ed5e42a480

curl: (92) HTTP/2 stream 1 was not closed cleanly: PROTOCOL_ERROR (err 1)

The command '/bin/sh -c curl -o cppcms-${CPPCMS_VERSION}.tar.bz -L         "https://sourceforge.net/projects/cppcms/files/cppcms/${CPPCMS_VERSION}/cppcms-${CPPCMS_VERSION}.tar.bz2/download"     && tar -xjvf cppcms-${CPPCMS_VERSION}.tar.bz     && mkdir cppcms-${CPPCMS_VERSION}/build     && cd cppcms-${CPPCMS_VERSION}/build     && cmake ..     && make -j ${PARALLEL}     && make install     && cd /app/deps     && rm -Rf cppcms-${CPPCMS_VERSION}' returned a non-zero code: 92

script returned exit code 92

I am afraid https://wiki.jenkins.io/display/JENKINS/Naginator+Plugin is the only bigger step forwards.

Unfortunately, it will not fix the problems for Travis or Cirrus.

Do we updated "Times failed" in the start post? check_external_example_codegen_econfis happening quite often currently.

Trying to update the start post or trying to fix all these issues is hopeless. We need automatic retrying. I hope @Mistreated will implement this soon on our new server.

What do you think about #3224?

Problems are solved now. Please open new issues if builds still fail.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dmoisej picture dmoisej  ·  3Comments

markus2330 picture markus2330  ·  3Comments

mpranj picture mpranj  ·  4Comments

markus2330 picture markus2330  ·  4Comments

markus2330 picture markus2330  ·  4Comments