Libelektra: Build Server stuff

Created on 13 Dec 2014  ·  585Comments  ·  Source: ElektraInitiative/libelektra

This issue gives up-to-date information about the health of the build system.

Report here any permanent problems (that cannot be fixed by rerunning the build job). Temporary problems should be reported in #2967.

Current Issues (ordered by priority):

  • [ ] Continuous Releases (see #3519)
  • [ ] check if make uninstall leaves a clean system, see #1244
  • [ ] check if any temporary files are left over after running test cases
  • [ ] Check for file names find . | grep -v '^[-_/.a-zA-Z0-9]*$' see #1615
  • [ ] add -Werror to build jobs without warnings: #1812
  • [ ] check if core builds with c99

Less important issues (need discussion first):

  • [ ] integrate link checker (see #1898) [done via cirrus]
  • [ ] add white spaces in top-level directory (above source&build) [done via travis]
  • [ ] simulate too little space (e.g. with limited tmpfs) [needs to be done manually first]
  • [ ] add ninja build (warnings as errors?) [now done via travis on Mac OS X]

Fixed issues:

  • [X] complexity checker: oclint (4 level)
  • [x] remove redundant jobs
  • [x] more build scripts in source?
  • [x] readding the -xdg build job (because we lost debian-unstable-mm)
  • [x] RelWithDebInfo in https://build.libelektra.org/jenkins/job/elektra-multiconfig-gcc-stable/203/ skipped?
  • [x] Rename elektra-gcc-configure-debian-optimizations to elektra-gcc-configure-debian-no-optimizations
  • [x] use higher -j on the mm agents (done for libelektra build job)
  • [x] jobs to update global repo so that not every job needs to refetch whole source again.
  • [x] enable elektra-clang-asan again
  • [x] stretch build agent which builds Elektra debian packages needs webserver
  • [X] have docker variants with minimal dependences
  • [x] run bashism checker
  • [X] build and install CppCms (build job for cppcms)
  • [X] minimal debian repos
  • [X] fix walking error on some jobs (e.g. doc, todo)
  • [x] gnupg2 on debian-wheezy-mr and debian-strech-mr
  • [x] fast build in passwd broken?
  • [x] build+source directory should contain spaces, define the names globally -> elektra-gcc-configure-debian-intree

Obsolete/irrelevant Issues [reason]:

  • [ ] Install bash-completion on wheezy node? [wheezy too old]
  • [ ] do not work in PRs, master is build: elektra-git-buildpackage-jessie/elektra-git-buildpackage-wheezy [wheezy too old]

Hi!

First thank you for your build agents, they are really fast and greatly contribute to better build times.

There are some missing packages, though:

http://build.libelektra.org:8080/job/elektra-gcc-i386/lastFailedBuild/console

DL_INCLUDE_DIR=/usr/include
DL_LIBRARY=DL_LIBRARY-NOTFOUND
CMake Error at cmake/Modules/LibFindMacros.cmake:71 (message):
  Required library DL NOT FOUND.

  Install the library (dev version) and try again.  If the library is already
  installed, use ccmake to set the missing variables manually.
Call Stack (most recent call first):
  cmake/Modules/FindDL.cmake:18 (libfind_process)
  src/libloader/CMakeLists.txt:6 (find_package)

and in build:
http://build.libelektra.org:8080/job/elektra-gcc-configure-debian/lastFailedBuild/consoleFull

the error is weird and additionally:

 Could NOT find Boost
-- Exclude Plugin tcl because boost not found
build continuous integration

Most helpful comment

@Mistreated thank you! Let us see if this enough. I hope pushes to master still trigger the master builds?

master branch is now an exception of the following rule:

Suppress automatic SCM triggering

As for

Having the hetzner node would be very good, nevertheless. Are there any problems if the node is used by two build servers at the same time? Or if it is a problem: isn't it very easy to simply clone the CT?

I added a new CT (hetzner-jenkinsNode3).

All 585 comments

@markus2330

Just pushed a few build system related fixes. But you need to fix some packages on your stable debian-stable machine as well:

  • Please install qtdeclarative5-dev from wheezy-backports (you can remove /opt/Qt5.3.0 afterwards)
  • Please install java8 as package:

    • Use this method: http://www.webupd8.org/2014/03/how-to-install-oracle-java-8-in-debian.html

    • Let cmake actually find jdk8: cd /usr/lib/jvm/ && ln -s java-8-oracle default-java

    • echo -e "/usr/lib/jvm/java-8-oracle/jre/lib/amd64\n/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server" > /etc/ld.so.conf.d/java-8-oracle.conf && ldconfig

    • kill + restart the local jenkins java process. Otherwise all builds will fail

    • Optional: Remove jdk7

Looks good, thanks for fixing those issues.

I also did those steps on debian-stable agent.

For other machines installing qtdeclarative5-dev was not possible, because it conflicts with qdbus which is needed by kde4. So I restored the previous script configure-debian-wheezy as configure-debian-wheezy-local.

I also added the installation steps you mentioned as notes in the README.md because they might be of interest by others.

Thanks for upgrading the agents!

Stuff that is missing on stable

1.) latex (+ I think texlive-latex-recommended is needed, too)
see http://build.libelektra.org:8080/job/elektra-doc/495/console

-- Found Doxygen: /usr/bin/doxygen (found version "1.8.8") 
CMake Warning at doc/CMakeLists.txt:46 (message):
  Latex not found, PDF Manual can't be created.


-- Found Perl: /usr/bin/perl (found version "5.20.2") 
-- Configuring done
-- Generating done
CMake Warning:
  Manually-specified variables were not used by the project:

    BUILD_EXAMPLES

2.) Can you install clang (for elektra-clang, clang of wheezy won't work)?
3.) Can you install mingw+wine for elektra-gcc-configure-mingw?

apt install --no-install-recommends doxygen-latex + clang + mingw done

Why do you need wine?

Btw, you should change i586-mingw32msvc-X to i686-w64-mingw32-X in Toolchain-mingw32.cmake. Right now this won't work on unstable.

Thank you for docu!

wine is needed to execute the cross-compiled windows binaries (e.g. exporterrors.exe)

I think you installed the mingw that builds for w64. In the mingw32 package, there is still a /usr/bin/i586-mingw32msvc-c++.

A new toolchain file for w64 is nevertheless appreciated.

I installed gcc-mingw-w64-i686 which is the x64 build of mingw with i686 as target.
The package mingw32-binutils is deprecated and not available on unstable any more.

Wine installed on both containers.

Actually, the mingw build is bound to stable, so that should not be an issue.

MinGW-w64 is a fork on mingw and is quite a different target. Nobody tested it up to now.

thanks for installing wine

Mingw-w64 looks superior. Maybe it's time to move on :-)

Contributions welcome ;) I do not have any machine to test it.

I am afraid you got the wrong wine, it should be apt-get install wine32

see also http://build.libelektra.org:8080/job/elektra-gcc-configure-mingw/218/console

Nope.

root@debian-stable:~# apt-get install wine32
....
E: Package 'wine32' has no installation candidate

ok, dpkg --add-architecture i386 will solve this. But can't you just pin the mingw/wine job to your build machine? The mingw setup is rather special.

Edit: I'll see if I can get elektra build with mingw-w64 so I don't need to install tons of i686 libs.

The problem is I do not have a spare jessie machine and wheezy's mingw does not know C++11.

I managed to get mingw-w64 working. However std::mutex is not available because there's no glibc on windows and std::mutex depends on pthreads. Any ideas?

Wow, thanks!

Does it lead to a compilation error? The std::mutex is not used for internal
functionality, but only in a header file to be included by a user. It is used
in test cases though.

One solution for compiliation problems is to provide a std::mutex in the mingw
case throwing system errors on every attempt to lock/unlock. Actually, I would
expect the mingw people to at least provide something like that (e.g. when
some macro is set, similar to -D_GLIBCXX_USE_NANOSLEEP)

https://github.com/meganz/mingw-std-threads might be another way. But that is
most likely only useful if all test cases except the ones involving std::mutex
already run.

Basically, this is only one instance of C++11 not properly available.

mingw status right now:

  • added dlfcn-win32 as external project to libloader. this way cmake checks out + compiles the library as an additional build step. I'm linking the archive to avoid additional dll deps.
  • added winsock2.h/ws2_32.dll dependency to cpp11_benchmark_thread. required by gethostname()-call

Right now I'm building with -static-libgcc + -static-libstdc++. Otherwise wine is unable to find the dlls. Additional mutex doesn't compile either. I tried mingw-std-threads. just got more compile errors :-)

If I switch from x86_64-w64-mingw32-X to x86_64-w64-mingw32-X-posix std::mutex compiles fine, because pthread stuff is defined. However I get an additional dependency to libwinpthread-1.dll, which wine is unable to find.

I think our best bet is using x86_64-w64-mingw32-X-posix though.

Again, I am surprised that you even have this problem. Up to now we were happy when we got a libelektra.dll.

I cannot say anything about this x86_64-w64-mingw32-X-posix decision, because I do not use it and do not know the implications. I am wondering that such a posix-lib even exists, I thought that the posix-layer approach is cygwin and not mingw.

Does this decision even has an effect on libelektra.dll? If its only for the test cases, no one will care (as long as the build server is able to run it). If the test cases run, it will be a huge benefit. (See #270 where the unit tests unveiled some strange bugs on Mac OS X)

It seems like that libwinpthread-1.dll can be downloaded, I do not know if they work with wine though? Can you also add it as external project like done with dlfcn-win32 (so that all dlls are handled in the same way)? Otherwise, if you need to download 1 or 3 dlls for the tests might not really matter (again, I am no user, and do not understand the deployment concept, if there is any, of windows dlls).

@beku What do you think? Do you have time to test our latest 0.8.13 mingw-w64 build on Windows together with oyranos?

Are tests usually enabled for the mingw build job? Yesterday all of them were disabled.

Yes, they were disabled. But afaik examples/benchmarks like cpp11_benchmark_thread were disabled, too. So I thought you changed it and compile more than it was done previously.

I compiled the whole repo with C++11 enabled. Nothing more.

But executables like bin/basename.exe built with -posix run fine as long as you copy the required dlls to the bin directory (thank you windows for not having RPATH). I haven't found a way to a) let cmake find the dll directory + b) point wine to the dll directory.
I thought static linking would work but then the build fails with duplicate symbols during linking the elektra dll. Because the dll already has the symbols included.

@markus2330 I managed to get elektra to compile with mingw + running with wine without copying any dlls. The trick is to always enable static linking for both executable AND shared objects (CMAKE_SHARED_LINKER_FLAGS/CMAKE_EXE_LINKER_FLAGS => "-static").

To work around duplicated symbols I've added version-scripts for libelektra and libelektratools. This way only our symbols get exported.

This works really fine. e.g.

$ wine64 ./bin/kdb-static.exe
Usage: Z:\home\manuel\build\bin\kdb-static.exe <command> [args]

Z:\home\manuel\build\bin\kdb-static.exe is a program to manage elektra's key database.
Run a command with -H or --help as args to show a help text for
a specific command.

Known commands are:
check   Do some basic checks on a plugin.
convert Convert configuration.
cp      Copy keys within the key database.
export  Export configuration from the key database.
file    Prints the file where a key is located.
fstab   Create a new fstab entry.
get     Get the value of an individual key.
[...]

$ wine64 bin/cpp_example_iter.exe
user/key3/1
user/key3/2
user/key3/3

Even bin/cpp11_benchmark_thread.exe works.

Other things just crash:

$ wine64 ./bin/kdb-static.exe get
wine: Unhandled page fault on read access to 0x00000000 at address 0x7fd0e8b62c8a (thread 0009), starting debugger...
Application tried to create a window, but no driver could be loaded.
Make sure that your X server is running and that $DISPLAY is set correctly.
Unhandled exception: page fault on read access to 0x00000000 in 64-bit code (0x00007fd0e8b62c8a).
Register dump:
 rip:00007fd0e8b62c8a rsp:000000000033f428 rbp:0000000000000000 eflags:00010293 (  R- --  I S -A- -C)
 rax:0000000000000000 rbx:000000000033f700 rcx:0000000000000000 rdx:000000000033f5b0
 rsi:0000000000000000 rdi:0000000000000000  r8:0000000000000000  r9:0000000000000072 r10:0000000000000000
 r11:000000000003f615 r12:000000000033f5b0 r13:00000000000373b0 r14:0000000000000000 r15:000000000033f930
Stack dump:
0x000000000033f428:  00007fd0e748ea93 0000000000000000
0x000000000033f438:  0000000000000000 0000000000000000
0x000000000033f448:  0000000000000028 0000000000010020
0x000000000033f458:  8d98315017c96400 6f46746547485300
0x000000000033f468:  687461507265646c 0000000000000000
0x000000000033f478:  0000000000000000 0000000000000000
0x000000000033f488:  000000000003fab0 0000000000030000
0x000000000033f498:  8d98315017c96400 6f46746547485300
0x000000000033f4a8:  687461507265646c 0000000000000000
0x000000000033f4b8:  0000000000000000 0000000000000000
0x000000000033f4c8:  0000000000000000 0000000000000000
0x000000000033f4d8:  0000000000000000 0000000000000000
Backtrace:
=>0 0x00007fd0e8b62c8a strlen+0x2a() in libc.so.6 (0x0000000000000000)
  1 0x00007fd0e748ea93 MSVCRT_stat64+0x92() in msvcrt (0x0000000000000000)
  2 0x00000000004744af in kdb-static (+0x744ae) (0x000000000003f9d0)
  3 0x000000000043bda5 in kdb-static (+0x3bda4) (0x000000000003f9d0)
  4 0x0000000000431d76 in kdb-static (+0x31d75) (0x00000000000360a0)
[...]

Right now I've simply added the version-script stuff without thinking about other compilers. Shall I continue my work or not interested?

crashes in src/plugins/wresolver/wresolver.c because pk->filename is NULL

pk is of type resolverHandles.user

I tried to take a look at the plugin but I fail to understand the for-loop in elektraWresolverOpen. The loop calls elektraWresolveFileName --> elektraResolve{Spec,Dir,User,System} which all malloc resolverHandle->filename and therefor leak memory.

Thanks for pointing that out! The code is obviously broken since the introduction in c87ae8e87a716b02b2c7ed790ad56a89d95547a9
During looping only and always the system handle was initialized. This lead to crashes when another namespace was used.

I fixed it in
edb4d50856bb5331749220de5a83fa2062624a9d

About continuing work: On the one hand, it would be nice if the compiled stuff also runs. On the other hand, the release should happen this weekend, so a pull request would be important soon (there should be at least a chance of a short feedback circle, e.g. what the version-scripts actually does)

But imho its enough if only one variant (the static compilation) works. Great to see the kdb-tool running!

Where can I find edb4d50856bb5331749220de5a83fa2062624a9d?

edb4d50856bb5331749220de5a83fa2062624a9d was pushed a bit later.

Which gcc versions are installed on debian-unstable-mm?

http://build.libelektra.org:8080/job/elektra-multiconfig-gcc-unstable/build_type=Release,gcc_version=5.2,plugins=ALL/56/console

says there is no gcc-5.2

Can you install as much compilers as possible, please?

In some issue or PR I've said that I've removed all compilers except the latest.
Edit: gcc 4.9 on stable, 4.9 + 5.x (default) on unstable

Please do these kind of tests (I find them highly unnecessary anyway) on your own containers. Mine won't stay forever anyway.

I have not read that. They have maybe 50MB each. Could you please install them again and answer the first question?

Maybe I told you in our meeting. But I've definitely told you.

debian-unstable:~ # gcc -v 2>&1 | tail -n 1
gcc version 5.2.1 20150903 (Debian 5.2.1-16)

The version specific binary is called gcc-5. No separate package for minor versions anymore. So your multiconfig-gcc with this level of detail is kind of obsolete. I recommend removing gcc 4.7 and replacing gcc-5.2 with gcc-5 and be done.

The only additional compiler available I haven't installed is gcc-4.8. And gcc-4.8 has already been tagged for removal.

Thanks for the info! Seems like the glory days of many available compilers is over.

I fixed multiconfig-unstable.

I will close for now, thanks for the excellent agent setup.

Hello, jessie (stable) needs some more packages. Could you please install:

  • [ ] fakeroot
  • [ ] gpg (+ create key for Autobuilder [email protected])
  • [ ] reprepro (maybe already installed, script did not go so far)

fakeroot installed, gpg + repropro is already installed.
Can you mail me your already existing gpg key? So both build machines have the same

Its ok to have different gpg keys. I am not sure if the current setup uses them at all, so first wait if http://build.libelektra.org:8080/job/elektra-git-buildpackage-jessie/2/ fails.

  • debhelper + libsystemd-journal-dev installed
  • python-dev is a wrong dependency. it should be python2.7-dev or python3-dev or both
  • why to we need python-support?

Thanks for installation!

python-dev is available for Jessie, and python-support, too. Please install them.

I tested it locally, when these packages are installed, it builds for jessie.

Sure, it's available but it's a wrong dependency. python-dev depends on python2.7-dev which is _not_ sufficient. Instead python2.7-dev + python3-dev is required.

python-support isn't required at all imho.

I do not know why the dependencies were chosen this way, most of the packaging was done by @iandonnelly during gsoc.

Yes, the packages should be updated to build python3 bindings, too. Currently, its simply not done. Nevertheless, you can install python3-dev, so that the build won't break (when python3 bindings+plugins are added to the debian package).

That doesn't mean they are correct :-) - I'm fairly sure about the python-dev deps.
Can you please replace them and remove the python-support dep?

python3-dev and python2.7-dev is already installed. Otherwise no binding would build.

Btw. the official debian package from @pinotree builds python3-only. It would be a waste of time to fix whats in our "debian" branch, the work of @pinotree is superior anyway.

When I find time, I will update our "debian" branch to what @pinotree has done in the official package. He already allowed us to do so. I will wait for the qt-gui update, currently there is no hurry to change. And having python2 support would be important for one installation (where cheetah is used, which won't work with python3).

I've never said I'll remove the python2 packages. All I'm saying is python-dev is an inaccurate dependency. We require explicit versions. So pythonX-dev is the correct dep to use.

Hopefully pinotree worked out the dependency correctly.

Btw, cheetah is dead. Don't use it.

Ok, replaced it. Please revert b7c266b36b0ab0fad9120e67a457b580c7c44690 and install python-support if it is needed after all.

I am sure pinotree did it correctly ;)

And it says: dpkg-checkbuilddeps: Unmet build dependencies: build-essential:native
http://community.markus-raab.org:8080/job/elektra-git-buildpackage-jessie/3/console

installed

python-dev is a wrong dependency. it should be python2.7-dev or python3-dev or both

  • python-dev installs the development package for the default Python 2 version; since Wheezy, this is Python 2.7
  • python3-dev installs the development package for the default Python 3 version; Python 3.2 in Wheezy, 3.4 in Jessie, and so far still 3.4 in Stretch (I guess soon will be 3.5)

So, if you want to build against the default Python 2/3 version, use respectively python-dev/python3-dev, not the pythonX.Y-dev versions (which you need to use when you explicitly want a precise Python version installed, even if not the only one installed on the system, and not the default one). Using either is what I recommend.

from python-dev description:
This package is a dependency package, which depends on Debian's default Python version (currently v2.7).

According to this text python-dev can surely depend on python3 sometime soon

Further more: There never will be another python2 version. So python2.7-dev will be the last python2 dev package ever.

Depending on python3-dev is what I said.

Now only the key is missing:

gpg: new configuration file `/home/jenkins/.gnupg/gpg.conf' created
gpg: WARNING: options in `/home/jenkins/.gnupg/gpg.conf' are not yet active during this run
gpg: keyring `/home/jenkins/.gnupg/secring.gpg' created
gpg: keyring `/home/jenkins/.gnupg/pubring.gpg' created
gpg: skipped "Autobuilder <[email protected]>": secret key not available
gpg: /tmp/debsign.DlSdnFtB/elektra_0.8.13-1.41.dsc: clearsign failed: secret key not available
debsign: gpg error occurred!  Aborting....
gpg: checking the trustdb
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0  valid:   1  signed:   0  trust: 0-, 0q, 0n, 0m, 0f, 1u
pub   2048R/08C91995 2015-09-30
      Key fingerprint = BA4C 688E 9071 FD3F 57ED  E9D6 D0A9 EDB9 08C9 1995
uid                  Autobuilder <[email protected]>
sub   2048R/E69F110A 2015-09-30

done

Thank you!

Please export /home/jenkins/repository via http.

cannot access /home/jenkins/repository: No such file or directory ?

@manuelm Could you please install ronn on the agents? (needed for generating man pages)

apt-get install ruby-ronn

done

Thanks, jessie packages build again, and man pages are now included!

Please install musl, i.e.

apt-get install musl musl-dev musl-tools

Thank you!

musl installed and agends upgraded

Two important things about the build server:

  1. Do not create new empty jobs, but rather duplicate, they have correct settings (except whats mentioned in number 2.).
  2. We should use reference clones (in /home/jenkins/libelektra) or prefer shallow clones for every build-job (currently done only for some, e.g. elektra-clang). Currently the traffic is >300MB on commits because of the many unnecessary reclones.

@mpranj It would be great if you can fix 2.

@markus2330 just to make sure: I should just apply the same clone behaviour to all build jobs like it is in elektra-clang?

Shallow clones applied to all build-jobs except:

  • [ ] elektra-git-buildpackage-jessie
  • [ ] elektra-git-buildpackage-wheezy
  • [ ] elektra-multiconfig-gcc-stable
  • [ ] elektra-multiconfig-gcc-unstable
  • [ ] elektra-source-package-test

These jobs check out to some sub-directory. Wasn't sure what you want there so I'll leave them as they are for now.

Thank you! Yes they need complete history and branches, shallow clones make no sense but the reference clone repository would be useful.

Jenkins was updated to 1.651.2. Also all plugins were updated.

I will keep the issue open for the reference clone repos. We should also have "cron jobs" which update the repos from time to time, ideally using jenkins itself.

Jenkins stopped building some jobs (since the update apparently). It fails with
ERROR: Couldn't find any revision to build. Verify the repository and branch configuration for this job.

Thanks for the info. I try to downgrade github request builder from 1.31 to 1.14.

Now it seems stuck when setting the build status for the Github commit. It does warn that this is deprecated in the config.

I also tried to downgrade every plugin with *git* in its name, but then there were still errors (strange error related to Mailer Plugin, downgrade of Mailer Plugin did not help). So I updated everything to recent versions again. The problem seems to be a known issue upstream:

https://github.com/janinko/ghprb/issues/347

I hope they will fix it soon.

Another question: Does someone know how to run multiple jobs for every PR? (I would like to run both elektra-mergerequests-stable and elektra-mergerequests-unstable)

The elektra-test-bindings job is working fine with parameterized builds (as also described in the upstream ticket). Couldn't we just switch it to parameterized builds? The bug has been reported upstream for a while, I don't see it fixed soon.

Good idea, we could change all PR jobs to parameterized builds, it actually has only advantages. It allows us to run the jobs manually by specifing a branch, too. And it also can be used for regular build jobs.

Ideally, every job could be executed by github PRs, too. (Except those specifically for non-PR tasks that update docu or coverage of the master branch)

A disadvantage of elektra-test-bindings' config is that it only does polling and takes quite long until it starts building (up to 5 min). I do not want to activate "Use github hooks for build triggering", however, to not break the build job.

Btw. are you sure that the "shallow clone" option is okay for the github pullrequest builder jobs?

I wonder how github picks the build job it uses for new PRs. Why is the elektra-test-bindings and elektra-ini-mergerequests never selected for a new PR? Why is it sometimes elektra-mergerequests-unstable and somtimes elektra-mergerequests(-stable)?

@manuelm do you have any idea?

Btw. somehow the communication of finished build jobs and github is severely impaired (even for elektra-test-bindings). It now says on nearly every build "Some checks haven’t completed yet".

A disadvantage of elektra-test-bindings' config is that it only does polling and takes quite long until it starts building (up to 5 min).

And this is a problem because? Testing takes more than 5 minutes anyway.

Why is the elektra-test-bindings and elektra-ini-mergerequests never selected for a new PR?

Why should it? elektra-test-bindings gets triggered by the "Trigger phrase" only. No idea what elektra-ini-mergerequests is.

Why is it sometimes elektra-mergerequests-unstable and somtimes elektra-mergerequests-stable?

The -stable/-unstable are new? I'm not sure triggering multiple jobs per new PR is possible. I would do subjobs.

Btw I've said this a few times already but I think the amount of jobs is getting ridiculous and a sign of a messed up config. But criticizing is always easier than solving something.

The 5 min is a problem when you want to debug the build server. And I still hope that we get a quick first test sometime, taking about 5 min.

Ahh, ok, I missed the option "Only use trigger phrase for build triggering". The config for the github request builder is really a mess.

Someone talked about github projects where they have multiple jobs running for every PR. (Displayed individually)

What is a subjob? Do you mean multijob?

Someone talked about github projects where they have multiple jobs running for every PR. (Displayed individually)

You'll have to add two services on github.

What is a subjob? Do you mean multijob?

Yeah multijob.

btw, what about https://docs.travis-ci.com/ ? Travis has support for OSX.

I know it won't replace jenkins but might replace the PR/on every commit builds. Jenkins can still do the multiple compiler/etc.. testing.
Edit: Travis even has gcc + clang.

Agreed, it would be interesting to use their CPU power/electricity for free as elektra is open source.

It is likely that the connection between github and jenkins actually is 1:1. In the github service I entered http://build.libelektra.org:8080/github-webhook/ and I did not find a way to create another URL in jenkins. (I only found a way how to specify a override, but this did not create a new URL).

In https://github.com/janinko/ghprb/issues/142 they discuss that it should "just work"? (Without adding multiple services)

The sha1 problem, however, should be solved now. It was broken because Jenkins introduced a new security measurement which prunes unknown environment variables. I fixed it as suggested (added -Dhudson.model.ParametersAction.safeParameters=ghprbActualCommit,ghprbActualCommitAuthor,ghprbActualCommitAuthorEmail,ghprbAuthorRepoGitUrl,ghprbCommentBody,ghprbCredentialsId,ghprbGhRepository,ghprbPullAuthorEmail,ghprbPullAuthorLogin,ghprbPullAuthorLoginMention,ghprbPullDescription,ghprbPullId,ghprbPullLink,ghprbPullLongDescription,ghprbPullTitle,ghprbSourceBranch,ghprbTargetBranch,ghprbTriggerAuthor,ghprbTriggerAuthorEmail,ghprbTriggerAuthorLogin,ghprbTriggerAuthorLoginMention,GIT_BRANCH,sha1 to /etc/default/jenkins).

About usage of additional build servers: Yes, go ahead. It also solves the issue of multiple build jobs for a single PR ;) I never used travis-ci, so cannot say anything about it. I gave the permission that travis has access to ElektraInitiative.

First travis build: https://travis-ci.org/ElektraInitiative/libelektra/builds/130425147
I think we need some yaml file so that travis knows what to do.

And I figured out how to do multiple jenkins jobs per PR, a different context for every build-job was needed. In the next meeting we discuss what the "fast" and other build jobs should do.

I'm working on travis (or checking some things out)

Have fun. Travis also added a github service, so I guess that a PR will be built with travis, too.

I'm already swearing loudly

-- Could NOT find JNI (missing: JAVA_AWT_LIBRARY JAVA_JVM_LIBRARY JAVA_INCLUDE_PATH JAVA_INCLUDE_PATH2 JAVA_AWT_INCLUDE_PATH)
-- Exclude Plugin jni because jni not found

I'm unable to get the java plugin configure correctly. However Java bindings work. On debian unstable. Any ideas? Looking at the cmake module doesn't help much.

Edit: /usr/lib/jvm/java-8-openjdk-amd64/include/linux/jni_md.h, /usr/lib/jvm/java-8-openjdk-amd64/include/jawt.h and /usr/lib/jvm/java-8-openjdk-amd64/include/jni.h is in place

Edit 2: Got it. JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/ ....

https://travis-ci.org/manuelm/libelektra/builds/130638376

Debian unstable built within a docker container. But building takes ages.
Any good ideas?

clang is often faster regarding build time, but i think the installation of the dependencies is what takes a large amount of time

Isn't there a more minimal debian docker image then the one used? It seems a lot of packages get installed that should not be needed.

@manuelm probably the dist-upgrade. a lot of packages get updated that are desktop specific like wayland

No. dist upgrade is short. maybe a minute. about 50% of the time is taken by installing the build deps.

I'm pushing the build image to hub.docker.com right now. Hope that will speed things up. But the image has 1.9gb

Elapsed time 14 min 8 sec

Not sure if we can do much better

like i said, clang maybe gets us 2-3 mins. at least it does for the aseprite project
https://travis-ci.org/aseprite/aseprite

It would be useful to have both compilers anyway.

Just had an idea while preparing work stuff: What if we extract the paths of all commits in the push request and build bindings/plugins only if they are affected? e.g.

  • change in cmake/* triggers everything (plugins + bindings)
  • change in src/bindings/foo triggers binding foo
  • change in src/plugins/foo triggers plugin foo
  • change in everything doesn't compile any plugins + bindings

We still have the daily/twice a day full build on jenkins.

@manuelm good idea, @tom-wa would write such a script, can you create a new issue for this?

@mpranj: Reminder: add Mac OS X builds to travis and add mingw builds to PR. (*BSD seems to be more effort)

@markus2330 now i understand @manuelm docker approach, travis will not support ubuntu 16.04 till next year so docker is needed to get all dependencies that ubuntu 14.04 does not have = swig3.0 libsystemd-devel.

I'm sorry I couldn't attend the meeting today. At work we're still preparing a big software rollout today, so I can't leave the office. But within a short delay I can answer e-mails.

I've started to add OS X builds for travis 2 days ago: https://github.com/manuelm/libelektra/blob/e41ac43a18e5e9f9640a4042a313cc43f2704f65/.travis.yml
Build is here: https://travis-ci.org/manuelm/libelektra/builds/130898079
Open things here:

  • [ ] cryto_openssl fails to compile
  • [ ] bindings tests fail
  • [ ] no java

I'm happy if anyone will assume my work from here. I don't have OS X and waiting for travis to inspect the OSX system sums up very fast.

re: docker: Yeah, travis default ubuntu version doesn't work well. Even cmake is too old.
Fetching the uploaded docker image takes only about 3mins. And adding more images is a no brainer. So I think that's a nice way to workaround any pitfalls the default travis linux environment has (or might have after an update).

I haven't figured out a good way to integrate the different build and test phases of libelektra (cmake + make + make test) with docker (build + run) + travis (before_install, before_script, script). Docker containers exit after the command completes. Since docker containers are meant to be throw-away, you cannot resume afterwards. So your disk/compile state is vanished, unless you mount a local directory into the container. Will continue to work on docker next week.

@manuelm Great, you got further then we thought. Mac OS X for PRs and per commit would be really great. A lot of people are using Mac OS X now and I do not want to break the build for them again and again. In the meeting today @mpranj said he would pick up your work. Do you want to create a PR with the travis file?

No, as the travis file still has to be modified afterwards. Otherwise it will build OS X only. I would rather prefer if @mpranj takes up my travis file and fix the remaining OSX related issues. I'll then take his travis file, convert it into a matrix build and integrate the linux/docker builds + #730 (if available by then)

PS: please do travis testing in a usernamespaced repo. You'll do a lot of pushes :-)

mingw64 builds on PR added, should work. Sorry for the delay. I'll look into travis today!

Is there a downside of enabling build jobs to be triggered with a phrase in a PR (with the Github PR builder)?

I'd like to configure the jobs from #745 so I can test whether I fixed it, but I can apply it to most/(all) build jobs.

EDIT: I'd rather not automatically start all jobs, we have quite a few already.

I think its a good idea if we can configure every job to be triggered with a phrase. I think there is a small downside (at least for elektra-test-bindings): you have to enter for which branch you want to build and cannot simply press "build job now". Would be great if you find a solution for that.

And you are right about that we should rather reduce the automatic jobs.

There's actually a very simple solution. We're using the (env)variable sha1 to build PRs. Parameterized builds prompt you for the value, whether a default value is set or not.

Solution: set the env-variable sha1 to master (in the jenkins config itself) and disable parameterized builds. If there's no objection to setting the variable, this would solve exactly what you mentioned above @markus2330.

I have already set it, so you can hit that build button on e.g. elektra-mergerequests and it will start building master.

Yes, this is a very good solution, I like it. It would also allow us to build a release branch with a single switch (if we need one in the future). Until then "master" is always the correct choice if not executed from within a PR.

I think it would also solve the problem of the filtered environment variable, we had earlier.

Then we can also think about reducing the build-jobs (no duplication for -mergerequest bulid jobs) and a new consistent naming schema. (Suggestions can be done here.) There might be one open problem: Currently we build coverage, docu,.. for both PRs and master and copy them to separate places. If we merge the build jobs we need a way to distinguish within the job master/PR-jobs to copy coverage, docu to different places.

I'm almost done applying this to all jobs (but the server _just_ got really slow).
Didn't apply to jobs which build wildcards ** (doc and some others, but very few)

You can always stop build jobs when you want to work on it if you restart them later (except in release time). Usually, jenkins itself is the reason for slowdown of the machine. At the moment a rsync from a backup might be the problem, but it is urgent.

Yeah no problem at all, it should be done but I'll make some last checks.

The news @ElektraInitiative/elektradevelopers:

  • as mentioned almost _everything_ can be build now from PRs and/or hitting only the build button
  • the trigger phrases are always the job name without the elektra- prefix. (e.g. elektra-clang: jenkins build clang please) I did not change jenkins build please and other old phrases for legacy reasons
  • the github build status message is always exactly the build job name

Thank you, well done! Please update doc/GIT.md so that everyone knows which phrases are working now.

(I hope the @mention works for a single message only and not everyone reads every message we write here)

Mac OS X for xcode 6.1 build seem to be broken:
https://travis-ci.org/ElektraInitiative/libelektra/jobs/138919488

I triggered a re-build for that one but it seems to me like a temporary travis failure.

Can you document how to retrigger a build for a PR? I did not know it is possible.

Directly on travis-ci.org, using your link above:
scr

I doubt this is document-worthy but I can to it nevertheless.
The build is still not working solely because of the git checkout. I don't think this is our fault.

Ah. I think you merged it before the build was triggered/started in the first place.
When I rebuild other previously successful PRs, it's broken also.

This is more of a travis problem than anything else.

Ok, thank you for investigating.

@manuelm debian-stable-mm seems to be unreachable (for both jenkins and for me from TU network). Could you please investigate?

Jul 07 15:14:37 <hostname> systemd-nspawn[544]: [  OK  ] Removed slice User and Session Slice.
Jul 07 15:14:37 <hostname> systemd-nspawn[544]: [  OK  ] Stopped target Graphical Interface.
Jul 07 15:14:37 <hostname> systemd-nspawn[544]: [  OK  ] Stopped target Multi-User System.
etc..

Looks like someone stopped the container. I've started it again.

btw, starting tomorrow morning I'll be away from home until August, 1. I'm still reachable by e-mail but expect a short delay.

Thank you for the quick fix! So I suppose you also won't be here for the next meetings.

Yep

Some jobs have the error:

Seen branch in repository origin/debian
Seen branch in repository origin/kdb_import_man
Seen branch in repository origin/master
Seen 3 remote branches
FATAL: Walk failure.

e.g. http://community.markus-raab.org:8080/job/elektra-icheck/lastFailedBuild/console http://community.markus-raab.org:8080/job/elektra-doc/lastFailedBuild/console

It might be caused by an jenkins update or @KurtMi creating the kdb_import_man branch?

Note to myself: cppcms needs to be installed.

Sorry for the branch, I mad a PR direct on the github page.

Is it easier to create PRs this way? Doesn't github offer to delete the branch after it was merged?

The change was so minimal, so I got lazy. For very small fixes yes, but apparently the branch will not get deleted afterwards. I have not seen any delete branch after merge.

I think the unstable build is broken:

Cloning the remote Git repository
Cloning repository git://github.com/ElektraInitiative/libelektra.git
 > git init /home/jenkins/workspace/workspace/elektra-mergerequests-unstable # timeout=10
Fetching upstream changes from git://github.com/ElektraInitiative/libelektra.git
 > git --version # timeout=10
 > git -c core.askpass=true fetch --tags --progress git://github.com/ElektraInitiative/libelektra.git +refs/heads/*:refs/remotes/origin/*
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git -c core.askpass=true fetch --tags --progress git://github.com/ElektraInitiative/libelektra.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: fatal: The remote end hung up unexpectedly

Full log

@KurtMi unstable work again (among most builds), but the Walk error persists on some of the simpler build jobs. It seems like the branch is still available somewhere, maybe in a cache on the build servers?

 > git -c core.askpass=true fetch --tags --progress git://github.com/ElektraInitiative/libelektra.git +refs/heads/*:refs/remotes/origin/* --depth=1
Seen branch in repository origin/debian
Seen branch in repository origin/kdb_import_man
Seen branch in repository origin/master
Seen 3 remote branches
FATAL: Walk failure.
org.eclipse.jgit.errors.RevWalkException: Walk failure.

@mpranj Maybe we should add more scripts in-source, this would allow easier updates for every build job. In #806 we found yet another bug with spaces in the build directory, so we should add globally (for every build-job) add spaces in the build directory. I would prefer if we can add a script jenkins-setup which exports some useful variables (such as export HOME="$WORKSPACE/user space) and does

mkdir "build space"
cd "build space"

Furthermore we should create build-jobs that update one global repo. Individual tasks see above.

The global repo could definitely help reduce bandwidth. Build scripts in-source could also be a good idea, at least they would be tracked by git.

I'm not a fan of spaces in paths, but sure.

fast build in passwd broken?

The fast build job is annoying, I try to remove kdberrors.h on every build and see if it works more smoothly then. In the long run @manuelm proposal in #730 is the best solution: we should simply check how the source got updated, and based on this take appropriate measurements.

I think #894 fixes the fast build, too, I will comment the line removing kdberrors.h out.

Some jobs are broken, for example the html doc job.

@mpranj Do you have time to look at it?

@markus2330 Done. The remaining build failures don't seem build system related.

@mpranj Thank you! What did you do to fix it? I think it would be useful if we collect here also solutions to build server problems.

I changed in "Source Code Management" > "Git"
"Branch Specifier" value "**" to "${sha1}"

This is what we use in the other jobs too. This allows to trigger the build by button (branch defaults to master) or by github PR builder (sha1 of commit).

I recall setting the ENV variable "sha1" to "master" once. It seems missing now, but the jobs work fine so let's ignore that.

I think we would be able to speed up the builds a lot more by using Object Libraries more frequently. A lot of object files are compiled multiple times. We would only have to make sure that the compile flags are the same for every place we use the object libraries in, but I guess this should be easily possible.

An example where it could make a big difference is KDB in my opinion.

@Namoshek Please only post build _server_ stuff here, not about build system in general. Object libraries are already used for plugins, but for different variants different object libraries are needed nevertheless (because of different compiler flags). But please report concrete suggestions as separate issue (Do you mean kdb tools?).

Jenkins was upgraded to 2.7, all plugins upgraded, recommended plugins were added:

  • Pipeline (installation seems to have failed?)
  • GitHub Organization Folder Plugin

and some plugins uninstalled:

  • Branch API
  • CVS/SVN (seems to be no longer essential)

Additionally, ruby-dev was installed on every agent.

I updated the "Current Issues" in the top post. It would be important that Elektra also compiles without any dependencies installed, so we should check this with build server agents that do not have any dependencies installed (except cmake and build-essential). FreeBSD and OpenBSD build agents, however, are important, too ;)

@mpranj Do you have any idea what is wrong with elektra-multiconfig-gcc47-cmake-options? They have "fatal: reference is not a tree:" errors all over. They job has "sha1" in its config?

I made the multiconfig independent of concrete compilers (there are enough other build jobs for specific compilers), so they should be able to run on any agent.

@markus2330 No idea. I didn't change anything and just:

  • manually triggered a build from master
  • triggered a build from github

Both builds were able to check out the tree and start building.
So: I can't reproduce it.

One idea: travis had problems when there was a PR and you merge it before travis could do the clone. Maybe something similar happened with elektra-multiconfig-gcc47-cmake-options since a build there takes ~3hrs.

Pushing artifacts to doc.libelektra.org works again, Jenkins and Plugins were upgraded.

I updated the new build server URL https://build.libelektra.org in the github's Webhooks. So hopefully the next PRs will be build again.

Jenkins home is almost full. It also seems not to be building PRs.

Jenkins home is almost full.

Thanks I resized it.

It also seems not to be building PRs.

Do you have any idea what could be wrong here? Manual triggering seems to work?

Publishing docu to doc.libelektra.org:12025 failed for the builds. I restarted the ssh server (on the build-homepage agent) and it seems to work again.

The vserver for *.libelektra.org seems to be not reachable. I reported it at hetzner.

The reason for shutting down the network connection from the container was that libelektra.org is compromised. See #1505 for more information.

It would be great if we can add a git-build-package for stretch, there are more and more places popping up where we would need debian packages built for stretch.

@BernhardDenner did you look through the top post? Is there something you can easily do? Is there something @mpranj needs to consider when improving build server jobs in the future?

As requested by @sanssecours I (temporarily) disabled elektra-mergerequests so that PRs do not always get a wrong error. Furthermore I added for @KurtMi

jenkins build gcc-configure-debian-optimizations please

@KurtMi If you need changes what the build job does, simply modify scripts/configure-debian-optimizations

@sanssecours now also has access to the build server.

Btw. you can cancel jobs if they are superseded by other jobs anyway (currently there is a heavy load). Only take care to not abort jobs for active PRs, otherwise the PR won't get green. (Unless you restart them with the phrase "jenkins build ... please".)

@sanssecours Jenkins was restarted (the second time). Could you please document here if you install new plugins. (Updates do not need to be documented).

Requests for restarts can also be done here.

I changed "Quiet period" from 2 to 5 to give more time to merge multiple PRs and/or push different commits without rebuilding repetitively.

Furthermore, I opened the issue #1689 describing timeouts in builds (I did not add it here due to long error messages).

I also moved some obsolete tasks above in the new section "Obsolete/irrelevant Issues [reason]:".

I updated the plugins on the build server. Hopefully the updates fix the problems we have in PR #1698 and PR #1692.

@markus2330 Can you please restart the build server?

I upgraded Jenkins from 2.73.2 to 2.73.3 and restarted Jenkins.

Hopefully the updates fix the problems we have in PR #1698 and PR #1692.

It might be a general problem not related to these two PRs? Hopefully it is fixed now.

Looks like JENKINS_HOME is almost full 😢.

@markus2330 👋 Could you please

  • clean the home directory or tell me how I can do that,
  • update Jenkins and all outdated plugins?

Thank you for pinging me!

Seems like a plugin had an "Arbitrary file read vulnerability", namely the "Script Security Plugin 1.35".

I upgraded all plugins and also upgraded jenkins from 2.73.3 to 2.89.1.

Furthermore, I resized the disk from 20GB to 50GB.

We should restart the server soon, there are some non-restarted processes that might be affected by library upgrades, which might be insecure at the moment (not related to jenkins, though). @BernhardDenner Can you do the restart (and do the fixes if something does not start up)?

Please do not hesitate to report anything I broke during these upgrades.

The server had load 20 and hardly responded. We need to be careful with "jenkins build all please" and in longer term we should move the agents away from the main server.

I upgraded to Jenkins 2.89.2 and restarted the server. I'll report when everything is up and running again.

Seems like all agents are now disconnected with the error "The server hostkey was not accepted by the verifier callback".

@BernhardDenner I saw that puppet apply was running, are you currently working on the setup?

I tried to downgrade to 2.89.1 and 2.73.3 without any success: connecting to agents still does not work.

A huge thanks to @BernhardDenner who fixed the ssh problem.

We should stop upgrading Jenkins without reading the release notes, seems like that even the stable updates break too many things. (That are not even revertible by downgrading!)

I have to report a major bottleneck at the build server.
elektra-multiconfig-gcc47-cmake-options takes 14h and the
elektra-multiconfig-gcc-stable takes 4h.
I am not sure if that is a new behavior and I am aware that these jobs are not a single build job, but this bottleneck should not be unnoticed.

Thank you for reporting. The idea was to distribute subjobs of these jobs to the ryzen hardware, unfortunately nobody had time for the setup. If someone is interested, please contact me.

a7.complang.tuwien.ac.at (ryzen) seems to have crashed. I reported the problem. Our admin will hopefully restart the computer on Monday.

I temporarily disabled the incremental (strange error, see #1784), the admin restarted the ryzen server, and then I restarted jenkins (because Jenkins could not connect to ryzen and there was a huge backlog of ryzen builds).

ryzen now works again and builds the backlog.

The idea was to distribute subjobs of these jobs to the ryzen hardware

@markus2330 I've noticed there is an option called Run each configuration sequentially in the configuration matrix settings of the multiconfig job. Maybe it gets distributed automatically if we simply untick this so it builds several config options at once, or have you tried this already?

No, I haven't tried it, please give it a try.

@markus2330 judging from the build server queue this seems to do the trick, i'll apply it to gcc-stable additionally after it worked for gcc-stable-multiconfig

I noticed however that the ryzen doesn't seem to handle those jobs. I think this is because it is configured to only handle jobs matching its tags and the multiconfig builds don't seem to set those tags appropriately on a first glance. So we should either make ryzen execute everything thats possible or set more tags on the build jobs. It looks like ryzen doesn't handle jobs which don't have any tag set at all.

i'll apply it to gcc-stable additionally after it worked for gcc-stable-multiconfig

Thank you!

I noticed however that the ryzen doesn't seem to handle those jobs.

No, it doesn't but it already executes a bunch of other jobs. But maybe we can make v2 to do so?

i'll apply it to gcc-stable additionally

done

But maybe we can make v2 to do so?

I've configured the v2 and now i only wait for the #1806 PR to be merged so i can allow more builds on it than one. I thought that 8 jobs should be fine for it since its an 8-core, with -j 2 in order to utilize the SMT as well?

To restart the build-v2 container in case v2 crashes or gets restarted simply type . Note this can only be done if the container has been built already - follow doc/docker/jenkinsnode/README.md for these instructions. Then use this command to restart after the container was created but has stopped:

docker start build-v2

Also, to forward the ssh connection of the new build node from the v2 via the a7 to the outside world, i've set up the following ssh tunnel on the a7 (the docker container maps its ssh port to 22222 on the v2):

ssh -L 0.0.0.0:22222:localhost:22222 <username>@v2.complang.tuwien.ac.at

Adding to that, the public ssh key of the docker container changes on every image rebuild and thus has to be adjusted in the build server as well. This is not necessary if the container only gets restarted. To find it out, enter the following on the v2:

sudo docker exec -it build-v2 bash
# now you should be on the docker machine
cat /etc/ssh/ssh_host_ecdsa_key.pub
> ecdsa-sha2-nistp256 <blablablalb> <root@6b906cc01f23>

Only copy the first two things, so the key algorithm and the key itself, don't copy this user information at the end to ryzen-v2's configuration in jenkins for the ssh key!

v2 is down, I informed the admin. Its quite strange: a7 and v2 are both completely new hardware and the incidents are quite frequent.

v2 seems to be back and i've restarted the build container there. So hopefully we have faster builds now again. Additionally i've added elektra-haskell to "jenkins build all please" as I want to have stable haskell builds for my typechecker, so testing is a good addition.

Furthermore i want to leave a note here that we additionally want to create another build node that takes care about the mm builds which now appear to be the new bottleneck on the v2 as well.

Last, @markus2330 i think the point run bashism checker is already done, as this is one of our usual tests /testscr_check_bashisms.sh.

All nodes for the label debian-jessie-homepage||homepage and the build agent debian-wheezy-mr are currently offline. Restarting the build agents does not work. It would be really nice if someone with SSH or physical access to these nodes could look into this problem.

I restarted the vservers but restarting the agents within jenkins did not work with the error "No valid crumb was included in the request". @BernhardDenner Do you have an idea?

Seems like v2 is down, too. So we have 3 non-functional agents 😢

v2 is back, but i wonder why it always goes down? could it be related to our build process?

Regarding the "no valid crumb" i saw that too when i tried to restart the agent on v2, but when i simply tried it again it worked.

v2 is back, but i wonder why it always goes down? could it be related to our build process?

It seems to be a kernel/hardware failure (not even sysreq works when the computer hangs). Our usage might trigger the error. The computer was running without any errors for several months and since we use it we already crashed it three times.

I upgraded the kernel and purged the X-server.

Regarding the "no valid crumb" i saw that too when i tried to restart the agent on v2, but when i simply tried it again it worked.

Thank you! I was now able to start the homepage agent again.

Furthermore, I disabled the agent debian-jessie-minimal and its build job. We should create docker containers for minimal jobs, I added this as task.

As we were surprised yesterday the community server was down because it crashed and a wrong ARP-cache redirected our IPs to other servers. After restarting, everything worked again, but the raid sync is still ongoing. (It might be so slow because of the high load.)

The community server has a near-constant load of 10. At the moment it is 13.20 11.29 9.35. We really should reduce the jobs directly running on the community server and move the load to v1. Any volunteers?

There is an upgrade of Jenkins from 2.89.2 to 2.89.4. Unfortunately I did not find an easy way to see the Changelog (apt-get changelog fails because it is an unofficial package). Any reasons to not do this update?

The upstream changelog is at https://jenkins.io/changelog-stable/
Apparently 2.89.4 contains security fixes.

Thank you for looking it up!

I upgraded to Jenkins 2.89.4 and everything is running again.

elektrahomepage was not started by default after reboots, I changed that (/etc/vservers/elektrahomepage/apps/init/mark=default).

I also activated the test-docker build job.

v2 is down, again :cry:

But at least a7 seems to be stable now.

I installed clang-format-5.0 on a7 and on the stretch node (debian-stretch-mr).

For the next PRs please reformat according to clang-format-5.0.

https://build.libelektra.org/jenkins/job/elektra-clang-asan/ was temporarily disabled.

We are currently investigating v2. UEFI is from 6.6.17. It seems like the crashes always happened at weekends, maybe there is a higher load at that time? I'll try to replicate v1 setup on v2.

v1 and v2 are up and running with the same kernel.

@e1528532 seems like your ssh bridge did not start and the command in doc/docker/jenkinsnode/README.md fails with "unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /root/Dockerfile: no such file or directory" and then "Unable to find image 'buildelektra-stretch:latest' locally". This means v2 is not reachable at the moment.

@markus2330 wrote: moved issue to #1829

I think one of the latest update to the build server broke elektra-gcc-configure-debian-stretch, which is not able to connect to the repository anymore:

stderr: fatal: Unable to look up github.com (port 9418) (Name or service not known)

.

I think the problem with elektra-gcc-configure-debian-stretch is the build server ryzen, which is unable to connect to GitHub. I changed the label for the build job from debian to debian-stretch-mr accordingly. Now the build job seems to work again.

ryzen, which is unable to connect to GitHub

Seems like our admin's NetworkManager fix with "manged=true" did not work reliable. After restart "/etc/resolv.conf" was again a dangling symlink. I fixed it again, GitHub should be reachable from ryzen. rzyen v2 is unfortunately still not reachable (ssh bridge is missing).

Elektra 0.8.22 is finally released. I'll add the link to #676 once the website has been built, the website building takes more than an hour. Maybe we can move the homepage build to a faster machine and only copy the resulting website to its location.

I think we have do something about the server that hosts http://build.libelektra.org. It is just unbearably slow and unresponsive 😢. Personally I do not care if a full build of all tests takes a long time. However, as it currently stands it takes minutes to even connect to the server, if we are able to connect to the server at all:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/jenkins/">GET&nbsp;/jenkins/</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
<hr>
<address>Apache/2.4.10 (Debian) Server at build.libelektra.org Port 443</address>
</body></html>

.

Yes, affected is not only Jenkins but everything else running on this server. For me the situation often is also unacceptable. It seems like there is too little RAM. (2GiG swap is used)

Jenkins might be the reason, there are dozens of Java processes which lead the list in htop. For a long time we had enough RAM and swap was hardly used, and we did not change much (except of upgrading Jenkins and increasing the number of build jobs).

I suggest to stop using the community server as agent at all. For this we would need v2 back but @e1528532 seems to be busy. We could also rent a better server but then we would need someone who has time for the migration.

@markus2330 Can you please restart the build server? Currently even simple jobs like elektra-todo fail.

i restarted the v2 on sunday but apparently its already down again, so we should first get v2 stable before thinking about putting other things on it.

I restarted Jenkins and v2. Jenkins seems to be running smoothly again.

@e1528532 The ssh tunnel seems still not to work. Even after restarting a7 it is not possible to connect to v2.

so we should first get v2 stable before thinking about putting other things on it.

The main downtime was caused by the ssh bridge. If v2 had troubles it was usually restarted within a day.
I now removed the rest of the X server, so I hope v2 is now stable, too. For a7 this seemed to be the trick (no restart was necessary for quite some time). Without load on v2 (which requires the ssh bridge), however, we wont know for sure if it is stable.

What about splitting up discussions about hardware (restarts) and software (Jenkins upgrades)?

There seems to be a network connectivity problem between a7 and v2. v2 is up and running but still I still get "No route to host". Seems like I cannot fix it today.

The network of v2 was down because the deinstallation of GNOME also deinstalled network-manager. We now fixed the network (using /etc/network/interfaces) and upgraded to the latest BIOS/UEFI. So hopefully everything is now stable.

Btw. there is one more hardware we could use via an ssh bridge... (PCS)

The ssh tunnel seems still not to work. Even after restarting a7 it is not possible to connect to v2.

Yes this was not automated. Now i have taken care about everything. The docker container should restart automatically now if the machine restarts. At least i've set the --restart flag to "always" according to https://stackoverflow.com/questions/29603504/how-to-restart-an-existing-docker-container-in-restart-always-mode#37618747

Furthermore I've created a new user called "ssh-tunnel-a7-v2" which has no password set on both a7 and v2 (so, disabled password authentication for that one). I created an ssh certificate for the user on the a7 and added the public key of it to the known hosts on v2. Then i added a systemd service to /etc/systemd/system/ssh-tunnel-a7-v2.service which opens the ssh tunnel automatically as a systemd service according to https://gist.github.com/guettli/31242c61f00e365bbf5ed08d09cdc006#file-ssh-tunnel-service . Therefore it should also work when the server gets restarted or the ssh connection crashes and no longer depend on me or my user. Due to the use of a certificate no passwords have to be used for connections.

On top of that, v2 is restarted of course with this new automated configuration active. Hopefully it survives the next crash (if there is one), theoretically it should but we will see.

The build job test-docker always fails, if Jenkins executes the job on ryzen v2:

docker inspect -f . elektra-builddep:stretch
/home/jenkins/workspace/test-docker@tmp/durable-7755b812/script.sh: 2: /home/jenkins/workspace/test-docker@tmp/durable-7755b812/script.sh: docker: not found
[Pipeline] sh
[test-docker] Running shell script
+ docker pull elektra-builddep:stretch
/home/jenkins/workspace/test-docker@tmp/durable-d1c2efc5/script.sh: 2: /home/jenkins/workspace/test-docker@tmp/durable-d1c2efc5/script.sh: docker: not found

. I wanted to restrict the job to nodes other than ryzen v2, but it seems the option for this step is missing in the configuration page of test-docker. Could someone please have a look and fix this problem?

Thank you looking into it! Isn't it possible to assign multiple labels to the agents? Then you could assign a unique label to ryzen v2 and tie the job to it.

Luckily, we will get support for our build server soon :+1:

Isn't it possible to assign multiple labels to the agents?

As far as I know yes, it is possible to assign multiple labels to an agent.

Then you could assign a unique label to ryzen v2 and tie the job to it.

As I already stated before [[1]] the option “Restrict where this project can be run” seems to be missing:

I wanted to restrict the job to nodes other than ryzen v2, but it seems the option for this step is missing in the configuration page of test-docker.

.

Ahh, I misunderstood your statement as: "There is no way to write a boolean expression that allows me to say (stable && !ryzenv2)", not that there is no option for agent restriction at all.

Maybe this can be done by the DSL. I'll ask Lukas if he knows what to do.

Hi,

as @sanssecours noted ryzen v2 does not have docker installed but it has the docker tag.
test-docker runs require nodes to have the docker tag.

Possible solutions are to either install docker on the node or remove the tag from the node in jenkins

Possible solutions are to either install docker on the node or remove the tag from the node in jenkins

Thank you for providing a solution for the problem. I just removed the docker tag from ryzen v2. As far as I can tell everything seems to work now.

I updated the description of the 'ryzen v2' node to reflect that it is actually 'only' a docker container running on v2. Hence why docker was not available even though it is installed on v2.

Also added a plugin to jenkins which allows easier build data visualisation (not having to click into each build)

v2 is down again, I reported it.

I rebooted v2 but could not reconnected the agent.

At least we finally got error messages of what happened before the crash (of course there is no guarantee that the error messages have anything to do with the crash):

watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [docker-containe:789]
...
NMI watchdog: Watchdog detected hard LOCKUP on cpu 14
...

Strange, it looks like my restart machinery did work, both the ssh tunnel and the docker nodes were restarted and i can connect to a7.complang.tuwien.ac.at -p 22222, which means everything should be open. However somehow jenkins just show me an infinite spinning wheel for some reason, no timeout, no nothing.

I tried my manual ssh bridge like we had before, the same. Restarted the docker container once more, the same. So honestly i'm not sure what exactly is wrong now without an error message, the only thing i found is some guy who apparently has a similar bug (spinning wheel but no message) but no solution on that other than restarting the whole master jenkins node (which i haven't tried): https://issues.jenkins-ci.org/browse/JENKINS-19465

EDIT: i tried one of the suggested workarounds (reset hostname configuration to something that doesn't exist, reconnect, then jenkins realizes the hostname is wrong, change back to the actual hostname, then it suddenly worked without any further hassle). So i guess this error happened instead of something with my restart setup but lets wait for the next crash to be sure on that ;)

@markus2330 i bet you already found this out yourself but a quick search showed me that this might be related to the c-state configuration: https://bugzilla.kernel.org/show_bug.cgi?id=196683 , there are some suggested workarounds for that

Seems like the build server ryzen is unable to connect to our repository:

Failed to fetch from https://github.com/ElektraInitiative/libelektra.git

.

the dns config was dangling again. Since I don't fully understand why it is
setup the way it currently is I only restored the nameserver settings and
restarted the docker build job.

On 12 March 2018 at 17:03, René Schwaiger notifications@github.com wrote:

Seems like the build server ryzen is unable to connect to our repository
https://build.libelektra.org/jenkins/job/test-docker/162/console:

Failed to fetch from https://github.com/ElektraInitiative/libelektra.git

.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-372363457,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOv-gcB-XWqDbqbRZfRnfnadYjZN21hks5tdpxLgaJpZM4DIApm
.

Since I don't fully understand why it is setup the way it currently is

I am afraid nobody understands it. Maybe the DNS server is misconfigured and does not give proper nameserver information. For v2 we uninstalled the network manager and it seems like resolv.conf is now stable there. So one option is to uninstall network manager on a7, too. (and use /etc/network/interfaces) There is no reason that v2 and a7 have diverging setups, it is only because of sloppy administration.

Ideally (in the long run), we manage both using Puppet.

https://bugzilla.kernel.org/show_bug.cgi?id=196683 , there are some suggested workarounds for that

C6 should be disabled but we will continue the investigation.

The new build agent "ryzen v2 docker" does not seem to have a D-Bus daemon running like "debian-stable-mm".

Can someone please either install/start it or tell me which script configures the multiconfig-gcc47-cmake-options builds so that I can add a snippet to make sure it is started?

@markus2330 I suspect since both ifupdown and networkmanager are enabled they both get into each other's hair. disabling one of the two certainly would help.

Okay, I removed gnome and network-manager on a7 to be consistent with v2.

The new build agent "ryzen v2 docker" does not seem to have a D-Bus daemon running like "debian-stable-mm".

The build agent lives within a docker container, hopefully @ingwinlu or @e1528532 can help you starting up the dbus daemon there.

Thank you. It should be fairly easy. I start it in the docker container (Ubuntu 17.10 artful built from docs/docker/Dockerfile) with the following commands:

mkdir /var/run/dbus # create directory for pidfiles & co
dbus-daemon --system

The build server ryzen is unable to connect to our repository again 😭.

Sorry, my fault. Seems like stopping network-manager was enough to break the system.

It should be fixed now. Please do not hesitate to report any further problems.

@markus2330 pretty sure it would have broken again on the next reboot

I took the liberty to redo the /etc/network/interfaces file (and move the configuration of the interfaces into /etc/network/interfaces.d)
that combined with the deinstallation of network manager should hopefully keep it stable

please review the configuration and maybe trigger a reboot to see if it worked

@ingwinlu Thank you for the fix, the reboot worked nicely.

I found out that I removed too many packages, so I installed Java (openjdk 9 headless and default-jre) again :smile:

@ingwinlu Could you please make dbus run on v2 agent? Ideally please also relocate "/home/armin/buildelektra-stretch/Dockerfile" to some non-user specific destination.

@markus2330 my proposal for how to proceed with the build environment actually foresees the removal of the current dockercontainer-on-v2 node and replace it with a docker capable one (i.e. no longer pointing to a docker container on v2 but point directly to v2).

Afterwards we can setup the build pipeline to build the image itself from Dockerfiles checked into the repository to provide the different environments needed for tests.

I can prioritize the roll out of a dbus capable docker image when I get to it, but I would not like to do work that will be irrelevant soon if I don't have to.

Yes, that makes sense!

The longer term goal of v2 is that we have to share it with some other docker containers (not for Elektra), so it would be a good thing if all our parts are virtualized in a way that they cannot influence the other docker containers. (maybe recursive docker or Xen?) We could/should do the same for a7 to have identical setups.

We will continue to have access directly on the machines but we should reduce any risk that Jenkins can kill Docker machines it has nothing to do with.

For some agents we already have a Puppet setup. It would be great if we do not bypass it or even better: extend this setup for a7 and v2. I hope @BernhardDenner can give you more info about that soon.

The build server is down again 🙌.

By the way, we could move the discussion about the build server status to a GitHub Team discussion, since this topic might not be interesting to all people subscribed to this issue.

Yes, the whole server is down, including the build server :cry: And v2 is down, too (independently). I restarted v2 and changed the C-States option in UEFI. But it seems like there is major problem with our server. Hopefully we get it replaced by better hardware with more memory :crossed_fingers:

GitHub Team discussion

Isn't everyone of ElektraDevelopers notified if we write something in the GitHub Team discussion? Here in this issue everyone can decide if he/she wants to subscribe.

For me a still open question is if we should split this issue up into two issues: hardware related and Jenkins related.

Isn't everyone of ElektraDevelopers notified if we write something in the GitHub Team discussion?

As far as I can tell, yes.

Here in this issue everyone can decide if he/she wants to subscribe.

That is also true for Team Discussions.

Build server is up again. Settings changed:

  • xms and xmx settings to reduce amount of garbage collection when lots of builds in queue
  • i noticed we use scm polling. I throttled the poller to 4 concurrent polls globally at a time to hopefully reduce a little bit of the spikes the server is getting currently.

EDIT:

* Set # of build to keep to 30 for all pipelines as according to multiple sources those get read when accessing the webui and thus a large number of old builds slows down requests

Update Jenkins to ver. 2.107.1

  • Update jenkins war file
  • Disable Use browser for metadata download plugin security setting as it would not allow to update plugins
  • Update plugins to latest available versions

EDIT 2018-03-18:

  • Added a second executor to all nodes running on mm

* deprioritised all agents running on mr

Master should be way more responsive now under load. Build all should be closer to 2h as to the 4h+ of before.


EDIT 2018-03-24:
sorry for the delays, busy week...

  • Added a new job to the jenkinsserver (elektra-jenkinsfile) that will run the Jenkinsfile found in the repo (once it exists)

EDIT 2018-03-28:

  • Redid the tunnel unit file, now it parses environment files and thus can be adjusted to point to multiple targets
  • Added v2 server as a slave capable of running docker

    • added jenkins user on v2

    • installed openjdk-9 on v2


EDIT 2018-03-29:

  • Fix ulimit settings on jenkins master

Although I'm pretty sure @ingwinlu or someone is already on this: It seems like ryzen v2 is misconfigured:

FATAL: Could not apply tag jenkins-BUILD_FULL=ON,BUILD_SHARED=ON,BUILD_STATIC=ON,ENABLE_DEBUG=ON,ENABLE_LOGGER=ON-185
hudson.plugins.git.GitException: Command "git tag -a -f -m Jenkins Build #185 jenkins-BUILD_FULL=ON,BUILD_SHARED=ON,BUILD_STATIC=ON,ENABLE_DEBUG=ON,ENABLE_LOGGER=ON-185" returned status code 128:
stdout: 
stderr: 
*** Please tell me who you are.

Run

  git config --global user.email "[email protected]"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: empty ident name (for <[email protected]>) not allowed

from https://build.libelektra.org/jenkins/job/elektra-multiconfig-gcc47-cmake-options/185/BUILD_FULL=ON,BUILD_SHARED=ON,BUILD_STATIC=ON,ENABLE_DEBUG=ON,ENABLE_LOGGER=ON/console but happened on multiple occasions.

Oops sorry. It has no deps installed and should only act as a docker host. I will remove the additional flags.
//EDIT: should be done. hopefully that was enough
//EDIT2: I also disabled test-docker for now as it obviously can not find the build images required to run the tests.

But damn the thing is fast if it actually gets something it can build
//EDIT3: renabled test-docker to only run on nodes with docker-prefab tag and gave that tag to ryzen

Sadly the problem seems to be bigger than originally expected.

Some jobs have their nodes hardcoded. Some tags are outdated (stable on jessie nodes). Some jobs did not require the right nodes and where only executed on the correct one because it was executed there once before and successfully built.

The introduction of a new node (ryzen v2 native) scrambled the allocation around a bit even though it should not have.

Please expect some unexpected behaviour till everything is running where it was again.

Changelog:

  • renamed nodes, <os>-<hostname>-<buildenv>
  • disabled elektra-multiconfig-gcc47-cmake-options
    it actually has not been running on gcc47 slaves for quite some time now, with a mix of gcc49 or gcc63 building depending on where it was scheduled. If renabled it should probably go onto the gcc63 enabled dockercontainer on v2
  • retagged a ton of jobs (might have missed some of them)

    • elektra-todo was requiring stable, but not all stable nodes actually had sloccount installed

    • many more similar cases

A7 seems to be down

waht notifications@github.com schrieb am Do., 29. März 2018, 21:24:

Although I'm pretty sure @ingwinlu https://github.com/ingwinlu or
someone is already on this: It seems like ryzen v2 is misconfigured:

FATAL: Could not apply tag jenkins-BUILD_FULL=ON,BUILD_SHARED=ON,BUILD_STATIC=ON,ENABLE_DEBUG=ON,ENABLE_LOGGER=ON-185
hudson.plugins.git.GitException: Command "git tag -a -f -m Jenkins Build #185 jenkins-BUILD_FULL=ON,BUILD_SHARED=ON,BUILD_STATIC=ON,ENABLE_DEBUG=ON,ENABLE_LOGGER=ON-185" returned status code 128:
stdout:
stderr:
* Please tell me who you are.

Run

git config --global user.email "[email protected]"
git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: empty ident name (for jenkins@v2.complang.tuwien.ac.at) not allowed

from
https://build.libelektra.org/jenkins/job/elektra-multiconfig-gcc47-cmake-options/185/BUILD_FULL=ON,BUILD_SHARED=ON,BUILD_STATIC=ON,ENABLE_DEBUG=ON,ENABLE_LOGGER=ON/console
but happened on multiple occasions.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-377344978,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOv-ifc-Ns9q0wuscPa3t8AMo15A07iks5tjTTcgaJpZM4DIApm
.

Do you want to work on it? I could reboot it today. As alternative our admin or I could reboot it on Tuesday.

If it is not too mich of a hassle. Else i cant work on my PR over the
weekend

markus2330 notifications@github.com schrieb am Sa., 31. März 2018, 14:33:

Do you want to work on it? I could reboot it today. As alternative our
admin or I could reboot it on Tuesday.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-377689937,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOv-qBFg1qYb4kI4wGRkjgEywr4VA0Hks5tj3eegaJpZM4DIApm
.

Okay, I restarted it and also disabled the C6-State in the BIOS/UEFI (was on enabled). I also removed gnome/xorg (I thought I have already done that?).

Btw. the screen was completely black, so we can only guess what the cause was.

these have been popping up on a7 every 10 minutes or so:

Apr 04 07:14:23 a7 kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 07:14:23 a7 kernel: [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc
Apr 04 07:14:23 a7 kernel: [Hardware Error]: Error Addr: 0x00000003705a2f00
Apr 04 07:14:23 a7 kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x0000015c0a400f03
Apr 04 07:14:23 a7 kernel: [Hardware Error]: Unified Memory Controller Extended Error Code: 0
Apr 04 07:14:23 a7 kernel: [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
Apr 04 07:14:23 a7 kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x700b45 offset:0xf00 grain
Apr 04 07:14:23 a7 kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
lines 5977-6036/6036 (END)

Those might be the reasons for the a7 downtime as well as some of the strange build behaviours on a7 latelty

Disabled valgrind section in elektra-ini-mergerequests as it was run via make run_memcheck

these have been popping up on a7 every 10 minutes or so

Yes, we already saw them. When the computer was bought, someone actually checked if ECC works and no such errors occurred back then. Are the frequency of these errors somehow dependent on the load of the system?

Disabled valgrind section in elektra-ini-mergerequests as it was run via make run_memcheck

Thanks for cleaning this up.

I am having random outtakes (containers crashing, builds stopping in the middle, disconnects, ...) on a7 without any 'real' logs again. only the already mentioned memory corrections.

Thank you, seems like something is very wrong. And we now also have uncorrectable errors:

Apr  5 09:50:40 a7 kernel: [39549.503787] mce: Uncorrected hardware memory error in user-access at 73d6ce880
Apr  5 09:50:40 a7 kernel: [39549.503794] [Hardware Error]: Uncorrected, software restartable error.
Apr  5 09:50:40 a7 kernel: [39549.505882] [Hardware Error]: CPU:2 (17:1:1) MC0_STATUS[-|UE|MiscV|-|AddrV|-|Poison|-|-|UECC]: 0xbc002800000c0135
Apr  5 09:50:40 a7 kernel: [39549.506581] [Hardware Error]: Error Addr: 0x000000073d6ce880
Apr  5 09:50:40 a7 kernel: [39549.507287] [Hardware Error]: IPID: 0x000000b000000000
Apr  5 09:50:40 a7 kernel: [39549.507980] [Hardware Error]: Load Store Unit Extended Error Code: 12
Apr  5 09:50:40 a7 kernel: [39549.508677] [Hardware Error]: Load Store Unit Error: DC data error type 1 (poison consumption).
Apr  5 09:50:40 a7 kernel: [39549.509378] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Apr  5 09:50:40 a7 kernel: [39549.510136] Memory failure: 0x73d6ce: Killing java:1470 due to hardware memory corruption
Apr  5 09:50:40 a7 kernel: [39549.510908] Memory failure: 0x73d6ce: recovery action for dirty LRU page: Recovered

there goes a7 again.

on a more productive front: I installed the blue ocean frontend to jenkins. Preview

there goes a7 again.

Its really frustrating. I restarted the machine and reconnected the agents. I'll ask our admin to replace the RAM tomorrow, so expect downtimes tomorrow.

on a more productive front: I installed the blue ocean frontend to jenkins. Preview

Looks great! Maybe you can show it to us in the next meeting?

Our admin is already in the weekend. I'll reboot a7 and reset BIOS/UEFI to factory defaults. If the errors continue over the weekend he will hopefully exchange the RAM.

EDIT: No build job was running, thus no build job was canceled.

EDIT: Everything is up again. No Memory errors so far.

Looking much better. Did somebody watch too much linus tech tips and overclock the server?

Sorry, I had to stop Jenkins for some time. The server had load 20 and things that I needed to do were not possible anymore (>1h waiting time, then I gave up and stopped Jenkins).

Is it possible that even non-started build jobs need RAM? (the queue list was very long) Otherwise the local build jobs are the best candidate for these problems. (3 local build jobs were running)

@ingwinlu Ideally, we do not build anything on that server. Furthermore, can we make build jobs dependent on the load on a server. (Do not start jobs on a server with load > 5?).

I started everything again but I hope we find a quick solution for that.

I pumped the VERSION and CMPVERSION in the Jenkins system settings.

@ingwinlu It would be great if we had these settings also within Jenkinsfile.

@markus2330 see 8de9272051fe903a7df08f0abdf18879701f7ac9 for an example on how to achive this in an Jenkinsfile

Removed make run_memcheck from the following targets due to them failing since some time and finally showing up in the build system thanks to #1882

  • gcc-configure-debian-stretch-minimal
  • gcc-configure-debian-wheezy
  • elektra-gcc-i386

restrict elektra-gcc-configure-debian-stretch to run on nodes: stretch && !mr

Update jenkins master to ver. 2.107.2 + update all plugins

I tried to add allowMembersOfWhitelistedOrgsAsAdmin to all build jobs today but seems like I can still not trigger a build all (see #1863) properly and only some jobs get executed

@markus2330 https://github.com/janinko/ghprb/issues/416#issuecomment-266254688

Can someone please

  • fix
  • disable, or
  • (even better) delete

elektra-clang-asan please 🙏. Currently the build job fails although all the failing tests:

  • testlib_notification
  • testshell_markdown_base64
  • testshell_markdown_ini
  • testshell_markdown_mini
  • testshell_markdown_tcl
  • testshell_markdown_xerces
  • testshell_markdown_tutorial_validation

work just fine on Travis.

They don't test the same as (for example) they use different clang versions...

Since this thread is absolutly untrackable for bug reports or longer discussions I will open up new issues for clang and clang-asan as soon as i get to it.

They don't test the same as (for example) they use different clang versions...

I agree, while Travis uses the outdated clang 5.0.0, the Clang version on elektra-clang-asan is ancient (3.8.1). I do not see the value of an ASAN enabled build job for such an old compiler.

I created #1919 for the failing "testlib_notification" test on "elektra-clang-asan".

I tested a build all with all mr agents disabled and the master was perfectly responsive while a build all with all mr agents enabled actually timed out some of the builds. Hence #1866 will definitely provide an improvement if we can get rid of all the mr agents (-wheezy as it is not containerizable)

Further testing shows it is pretty much only the homepage builds job. I removed it from build all for now so it only gets run explicitly.

It will be replaced with a containerized solution.

v2 is online again with latest BIOS.

Please report any segfaults here (the CPU might be buggy).

a7 seems to be down?

On 4 May 2018 at 11:10, markus2330 notifications@github.com wrote:

v2 is online again with latest BIOS.

Please report any segfaults here (the CPU might be buggy).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-386545292,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOv-qhlMoQ78eNfpLpzXEBLTcq0pKT5ks5tvBsXgaJpZM4DIApm
.

a7 is up again with latest BIOS

a7 down again?

Yes, it had crashed. It showed the login prompt without any error message and no reaction at all to any input (including sys-req). Only hard-reset helped.

If you have any idea what the problem could be, please tell.

sadly no persistent journal is set up on a7 so no logs

when can we expect a7 and v2 to be available again?

Ohh, I did not know they were down. I'll ask our admin to reboot and if he is not able to I will do it at about 17:00.

Edit: He said he will reset them right now.

Edit: They are both up again.

i rebooted a7 and v2 manually today. it seems v2 is not reachable anymore. can you please it is actually running?

//EDIT: solved by fixing network configuration on both machines

apparently a7 has gone down again.

Ok, I'll reboot her. Otherwise we will never get this release done.

any indication of the cause? just networking issues or was the machine unresponsive again?

Everything should be up and running now.

I just got it at the right moment, there were some logs until it finally completely freezed.

The logs were:

INFO: rcu_sched detected stalls on CPUs/tasks:
...
rcu_sched kthread starved for 7770 jiffies
watchdog: BUG soft lockup - CPU#2 stuck for 22s! [docker-gen]
... (many repetitions)
NMI watchdog: Watchdog detected hard LOCKUP on cpu 2

That could be anything. from the faulty ryzen cpu to a bad psu :(

On 12 May 2018 at 14:33, markus2330 notifications@github.com wrote:

Everything should be up and running now.

I just got it at the right moment, there were some logs until it finally
completely freezed.

The logs were:

INFO: rcu_sched detected stalls on CPUs/tasks:
...
rcu_sched kthread starved for 7770 jiffies
watchdog: BUG soft lockup - CPU#2 stuck for 22s! [docker-gen]
... (many repetitions)
NMI watchdog: Watchdog detected hard LOCKUP on cpu 2


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-388552175,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOv-roPlXhrY0w_CFAmnRRDjVJgHQhSks5txtaugaJpZM4DIApm
.

About #1993

@ingwinlu Please disable jobs if they currently do not succeed (or at least do not trigger them by default nor with "jenkins build all please"). It should have high priority that we do not fail jobs in PRs where actually everything is okay. (asan failing for quite some time was not a good situation)

If they fail because of some Jenkins action you did, restarting the jobs would be nice :heart: Saying so here in #160 is also okay.

v2 is up again, with a new CPU!

Please submit many jobs for stress-testing :smile:

a7 seems to be down again.

Thank you, everything is up again.

I restarted a7 again.

Having a circuit which resets a7 every hour would most likely increase availability.

when can we expect to see a7 back online again?

a7 is back online

v2 is back online

The power supply of v2 will be exchanged in about 1h.

@ingwinlu can you disable the agents and enable them again afterwards? (If you are currently working on it.)

v2 agents are disabled for now

v2 is running again. It has a new power supply. Please submit many jobs to stress test the machine.

I updated jenkins plugins. The resulting reboot partly restored old jenkins nodes (before they were renamed) resulting in broken builds everywhere as git repositories were broken.

I cleaned up affected repositories and cleaned out cached docker containers just to be sure.

a7 is down again and as such docker based builds are not available again.

I am also rolling back the update to xunit as it violates security restrictions.

I rebooted a7 and reconnected all agents.

docker builds are not available as a7 is down again.

I rebooted the server and reconnected the agents.

I think replacing a7 is the best way forward, see #2020

I think its down again, my latest commit resulted in Cannot contact a7-debian-stretch: java.lang.InterruptedException

@e1528532 Thank you for writing it here! If you want, you can also vote in #2020

I restarted a7 and reconnected the agents.
Let us hope for the best that no problems occur during the weekend.

a7 is down again :cry: Surprising that it nearly worked the whole weekend. Might be the uptime record of the week. Nevertheless #2020 did not get many comments.

I restarted a7 (it reacted to sysctl) and someone else started jenkins. Everything is up and running again.

a7 just went down again.

Thank you for the info! I restarted a7 and reconnected the agents.

Does it make sense that we have the agent "debian-jessie-minimal"? I think you can remove it safely once it is integrated in Docker. (and it seems like it already is)

EDIT:
In https://build.libelektra.org/jenkins/computer/a7-debian-stretch/log
and https://build.libelektra.org/jenkins/computer/v2-debian-stretch/log
there are warnings:

WARNING: LinkageError while performing UserRequest:hudson.Launcher$RemoteLauncher$KillTask@544b40e
java.lang.LinkageError
    at hudson.util.ProcessTree$UnixReflection.<clinit>(ProcessTree.java:710)
    ...
Caused by: java.lang.ClassNotFoundException: Classloading from system classloader disabled
    at hudson.remoting.RemoteClassLoader$ClassLoaderProxy.fetch4(RemoteClassLoader.java:854)

bad news. v2 also went down.

EDIT: but I seem to be able to connect to it via ssh....
EDIT2: issued a reboot on v2 but now I can no longer connect. still pingable from a7 though...

EDIT3: now a7 is down as well.

Thank you for reporting! I rebooted a7 and v2. We should reconsider #2020.

i think v2 is down again:

Cannot contact v2-debian-stretch: java.lang.InterruptedException

With v2 everything was fine but a7 was down again. All agents are now online again.

v2 seems to be semi unresponsive again. pingable from a7 but no ssh action at all. should be the btrfs issues from the sympthoms alone. can you reboot everything before you go home for the weekend?

@ingwinlu Thank you, @waht and I rebooted v2 successfully but I cannot connect the agents (" Connection refused (Connection refused)"). Any idea what is wrong here? (interactive ssh login works)

ssh did only work when not connecting via the bridge.

after restarting the bridging service the connection could be established.

seems like the ssh-tunnel did end up in a weird undefined state and did not kill itself. Not sure why it did not kill itself (serveraliveinterval is on).

I also needed to manually cleanout all workspaces on v2 as the fs was corrupted and all build jobs on them failed.

a7 seems to be down. I am not sure if I can reboot it before monday.

I restarted a7 and v2. (v2 because there were error messages on a7 that it cannot create the ssh bridge to v2. But also after restart of v2 the same error messages occured. Nevertheless, the ssh bridge seems to work. Maybe some dependency (network?) is missing in the bootup scripts of a7?)

Maybe some dependency (network?) is missing in the bootup scripts of a7

No it is there.

it cannot create the ssh bridge to v2

I believe this behaviour occurs when the v2 kernel starts to be unresponsive (and thus no incoming network connections can be established). In the past you mentioned logs indicating btrfs errors on v2.

I prepare the build server for a shutdown to replace the CPU later.

a7 and v2 are back again (a7 with new CPU, v2 with its root filesystem checked)

while it kept up during the day (with consistent buildign) it seems like a7 just went down again.

Yes, I'll restart tomorrow morning.

We should again discuss #2020.

I restarted a7. It was again a CPU hang.

a7 is down again.

Thank you, I restarted it. All agents are again online.

Looks like some of the debian nodes are down and thus a few PRS are waiting for a long time already for the test execution to start. Is that intended?

Looks like some of the debian nodes are down and thus a few PRS are waiting for a long time already for the test execution to start. Is that intended?

During an upgrade on the mm-debian-unstable node the machine became unresponsive and we have not been able to connect to it since. Since the owner is not responding to our emails it will probably be gone forever.

While we have ported over a good amount of build jobs to the new system already the ones missing are those that are currently hanging in the queue.

I disabled the affected jobs and marked them as to be replaced + removed from the docs

@ingwinlu Thank you for taking care of this!

Readding the xdg tests is quite important if someone wants to work on the resolver. I added it at the top post here. Can you update what you already have achieved in the check list above?

This time v2 is the lucky winner.

@markus2330 i cleaned up the top post.

Thank you for the cleanup! I'll reboot v2 (and maybe a7, let us see) tomorrow in the morning.

I rebooted it. There was no reaction and no message. See #2020

Please reboot a7 and v2.

I rebooted a7. I did not find any problem with v2, should I reboot it anyway?

i7 is now available at 192.168.173.96 can you please create a bridge via a7 to it?

You need to create a ssh-tunnel-a7-v2 user on the machine or create an account for me.

We added an additional build slave i7-debian-stretch that will help with libelektra buildjobs.

v2-docker-buildelektra-stretch (offline) as no more build jobs are scheduled on there. The ssh-bridge on a7 that exposed the agent has also been disabled.

Hello @ingwinlu ,
as discussed in the last meeting, I would need the access point.
Could you please send me the information per email, my email is in AUTHORS.md.
Btw. could not find your email anywhere, maybe you should add yourself to this file.

Our v2 build server will be offline till 31.07 09:00 as we are running benchmarks on it. Expect longer build times.

Sorry for the inconvenience.

//EDIT: extended downtime by 2 hours

Seems like the extension now also passed. Would be good to have back the fast builds after lunch (about 13:00). :smile:

mm-debian-unstable was upgraded and is back online. Are there jobs which we can reactivate and pin to the server?

It seems like i7's disk is full. My jobs fail with No space left on device (Job and Job)

I really like the new build interface (dockerization & jenkinsfile) by the way. Very helpful for reproducing build errors.

Thank you for reporting!

Unfortunately resizing would require a restart (rootfs needs to be made smaller before the other could be extended) and only would bring 20G. I removed the Jenkins build folders but they were tiny, so we are still at 99%.

So cleaning up _docker would be more effective:
@ingwinlu seems like both _docker/overlay2 is huge. Any idea why all this stuff was collected there?

I force cleaned docker artifacts with docker system prune -fa. This
cleaned up around 190GB of space used in docker images.

On Mon, 13 Aug 2018 at 07:54, markus2330 notifications@github.com wrote:

Thank you for reporting!

Unfortunately resizing would require a restart (rootfs needs to be made
smaller before the other could be extended) and only would bring 20G. I
removed the Jenkins build folders but they were tiny, so we are still at
99%.

So cleaning up _docker would be more effective:
@ingwinlu https://github.com/ingwinlu seems like both _docker/overlay2
is huge. Any idea why all this stuff was collected there?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-412415127,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOv-oK9dSc27dbXOwgSq4xSWa4IXwiUks5uQRSwgaJpZM4DIApm
.

Thank you!

Can we put this in libelektra-daily or into a cronjob?

Daily does something similar but less aggressive. Will have to take another
look when I am back in Vienna.

markus2330 notifications@github.com schrieb am Mo., 13. Aug. 2018, 09:22:

Thank you!

Can we put this in libelektra-daily or into a cronjob?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-412430076,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOv-mO_0_b9gGl82qc56LbMRRiIC7Mhks5uQSkzgaJpZM4DIApm
.

As you might have noticed the build server was down in the morning (any maybe in the night). The power supply unit was damaged and is now replaced.

Furthermore a7 or v2 might be go offline for benchmarks for short durations in the next week. You will see the offline-message "benchmark" if anton starts benchmarks. If the computer stay offline for too long (e.g. more than one day) please contact us. (Then it might have been forgotten to toggle it to online again.)

Resolved an issue with our sid image.

testkdb_allplugins segfaulted in our sid image during debian-unstable-full-clang tests but only when executed on v2. I manually updated the image to use the latest available packages and pushed it onto our registry.

https://build.libelektra.org/jenkins/blue/organizations/jenkins/libelektra/detail/master/242/pipeline/411/ passed, but will keep an eye on it.

The issue has been mentioned in #2216 and #2215 (@mpranj @sanssecours).

I implemented public access to our docker registry. See here for some documentation on how to access it.

Let me know if something does not work as expected.

//EDIT it seems as if pushing does not work even though login succeeds.
//EDIT2 public access is disabled again. https://github.com/moby/moby/issues/18569. restored functionality to build system
//EDIT3: public repo is up again at hub-public.libelektra.org

Anton wants to replace the mainboard of the computer where hyper-threading is deactivated next Tuesday or Wednesday (we can choose). Does someone needs the buildserver on one of these two days?

i restarted the docker registry and run a cleanup manually. hopefully that resolved any build issues with the website image.

@ingwinlu thank you for the maintenance work!

Unfortunately the WebUI stage fails quite reliably, e.g. 321 or 320 (failed even earlier but also when pulling WebUI?).

I am more and more convinced that the canceling of jobs on master is a bad idea. We have nearly no successful builds on master because the builds are either canceled or fail due to network problems. In the commit history it is hard to tell what happened because either way they are simply shown as failure.

As workaround I disabled the unreliable stages in c3b59ecef95287ffc33b094b37e03d0ec6b5710f but I hope we can enable them soon again!

Should a7-debian-stretch still be offline for the benchmarks? (taken offline since Feb 21, 2019 10:47:56 AM)

Thank you for reporting! Seems like Anton forgot to reactivate. I activated the node again and I also removed the old nodes (except of mm as they are still running).

There was a downtime of our server in the morning. Everything runs again but we got an offer that they will exchange the hardware. So most likely we will have another downtime of about an hour today.

The server is up again. Unfortunately we got the same hardware. If somebody has time for the installation/setup, we can upgrade the hardware.

Looks like the Jenkins builds are quite slow (multiple hours for a full build). As far as I can tell, only a7-debian-stretch and i7-debian are executing tests, while all the other nodes are idle. Is this expected behavior?

Thank you for reporting this problem!

No, this is not expected behavior. It seems like v2 is down. I'll reboot it asap.

v2 should now be up again

v2 is down and I am afraid it will stay like this until Monday. Builds will be very slow till then.

v2 is again up with a new mainboard

I'll upgrade all 3 build agents (i7, v2, a7, in that order) to Buster to avoid #2852

I'll try to keep the downtime as minimal as possible. Build jobs might fail, please restart them (after the agents are up again).

i7-debian-buster, former i7-debian-stretch is online again

v2 is next

v2-debian-buster and a7-debian buster are also online again

I restarted the previous successful build job on master to see if everything works again:
https://build.libelektra.org/jenkins/blue/organizations/jenkins/libelektra/detail/master/853/pipeline

Furthermore, I added the PR https://github.com/ElektraInitiative/libelektra/pull/2876 to enable buster build jobs.

v2 seems to be down (also ssh connection fails), unfortunately I am not in Vienna. I hope our sysadmin will fix it tomorrow.

a7 is now also down and with it i7 (which is connected via bridge over a7).

So currently no builds can be performed. I contacted the administrator.

All servers are up again. Please restart the builds by either pushing new commits or writing "jenkins build libelektra please" as comment to your PR.

Technical note: a7 got down because of "watchdog bug soft lockup". I tried to add "nomodeset nmi_watchdog=0". Let us hope they are not again as unstable as they have already been.

a7 (and also v2 and i7 because they are connected via a bridge over a7) are down. I contacted our admin. Please try to avoid to start builds now, as it will only make a long queue.

a7 is back online (was already yesterday), v2 was not affected

According to the build server status page the servers:

  • a7-debian-buster,
  • i7-debian-buster, and
  • v2-debian-buster

are down. The Jenkins data directory seems to be pretty full too. And while we are at it: It would be great, if we could upgrade Jenkins and it’s plugins. I am interested in fixing these problems. There are two problems though.

  1. I do not have much (or really any) experience administering servers.
  2. I would probably need physical access to the machines, since they seem to be quite unstable.

a7 is a bridge to i7 and v2, so with a7 being down we do not know about i7 and v2.

The access is no problem, I can give it to you. But you should be aware that upgrading Jenkins is a large operation as it usually requires to reconfigure Jenkins (according to release notes, which are many as we did not upgrade since a while) and to fix the Jenkinsfile (according to API changes from the plugins). Drop me an email if you want access.

a7 (and all others) are up again.

a7 is down again, I contacted the admin.

Technical note: a7 got down because of "watchdog bug soft lockup".

A quick search suggests that this might be a BIOS problem. Did anybody check if there are BIOS updates available?

a7 is a bridge to i7 and v2, so with a7 being down we do not know about i7 and v2.

That seems like a bad design. Is there no way around that?

A quick search suggests that this might be a BIOS problem. Did anybody check if there are BIOS updates available?

We installed a new BIOS, replaced the CPU and upgraded the kernel (see messages above). The system was stable since then. Now after upgrade to Debian buster it is unstable again.

Nevertheless, I asked the admin if there is a new BIOS available.

That seems like a bad design. Is there no way around that?

i7 and v2 are in a private network as there are not enough IPv4 addresses. I asked our admin if maybe a IPv6 setup would be possible.

I asked our admin if maybe a IPv6 setup would be possible.

We don't really need IPv6, it would be enough to use another more stable server as the bridge.

Most likely, v2 is as unstable as a7 (there was only one crash but this does not say much because immediately when a7 dies, it takes the load of v2). We could use i7 as bridge. But if v2 and a7 goes down, i7 is also not of much use, it would take many hours to finish a build job. Furthermore, i7 does not have enough space for the docker registry.

So this change would be a lot of effort with little gain.

To fix the actual problem of a7 and v2 is much more promising.

Furthermore, i7 does not have enough space for the docker registry.

In that case the only solution is to fix a7.

Unfortunately a7 is down again :cry:

I tried to boot with the old kernel but this did not help.

For BIOS there are some other versions available but according to their release notes there is little hope that they will fix this problem and there is no way to downgrade again if it would get worse...

The BIOS for a7 is currently upgraded. Furthermore, we will try to use a newer kernel from backports.

a7 will hopefully be up again soon.

The new BIOS did not help, now a7 crashed within minutes.

a7 is down again, I contacted the admin. The newer kernel from backports will be tried in the next reboot.

a7 is up again with the 5.2 kernel

I think it crashed again...

Do we still get the same error messages or is there at least some change?

Yes, a7 down again, I reported to the admin. He will tell us about the messages when restarting.

Does someone have another idea? (We upgraded BIOS and Kernel already.)

Some sources suggest problems with the nouveau graphics driver and that we should try nouveau.modeset=0 (somehow this is different to nomodeset). Also disabling "C-states" in the BIOS was suggested.

Yes, a7 down again, I reported to the admin. He will tell us about the messages when restarting.

Does someone have another idea? (We upgraded BIOS and Kernel already.)

maybe disable a7 as a jenkins slave to determine if it only occurs when 'real' load is on the machine.

@ingwinlu thank you, good idea. I reduced now to one build job an a7 (it was 2). For the weekend (once the admin left the office) I will then disable the agent completely.

@kodebach: thank you, I will forward the information to the admin.

Is there any timeline for when a7 will be up again?

a7 is up again, with hyperthreading turned off and only one concurrent build job.

We could also take some of the load off a7, by moving the alpine and ubuntu-xenial builds to Cirrus. Both of them are simple "build and test" runs. They aren't doing anything special like reporting coverage.

Cirrus allows 8 concurrent Linux builds per user. Currently the linkchecker build is the only Linux build on Cirrus.

In fact ubuntu-xenial is a bit redundant, since our Travis builds run on Ubuntu Xenial.

Thank you for the tips but we do not plan to offload any Linux-builds away from Jenkins. On contrary, @Mistreated will work on improving our Jenkins infrastructure to be even more up-to-date and useful (e.g. by building more packages). The advantages of our Jenkins are:

  1. we have it fully under our control
  2. we can easily scale it up (only Java+Docker is required on build agents)
  3. the Jenkinsfile is very neat and (for the most parts) quite easily extensible

But of course, everyone is welcome to also extend Cirrus (or any other additional build system which is offered for free, see #1540).

It was just meant as a temporary solution, to counteract disabling hyperthreading and limiting to 1 concurrent job on a7.

a7 only builds a small part (about 2/5), so the reduction by half should be barely noticeable. Or are there any specific problems now? (At the moment, of course, it takes time to catch up with the many jobs from the downtime.)

a7 only builds a small part (about 2/5)

2/5 is 40%. I wouldn't consider that a small part.

Or are there any specific problems now?

No, in fact it seems to be working better than before.

Sorry, I meant about 2/6 (1/6 is i7, 3/6 is v2). And this part is not removed but only reduced.

No, in fact it seems to be working better than before.

Perfect!

Seems like a7 is down again 😭.

Thank you! I reported it.

In future it would be excellent if the one who detects first can directly report it to

herbert at complang.tuwien.ac.at

It is enough to say, that "a7 ist leider nicht erreichbar".

And then also report it here, so that herbert does not get several emails.

Surely there is a way to make the Jenkins master server send such an email automatically and maybe also post to this GitHub issue. It would be very weird, if there was no Jenkins plugin for such a simple task...

Yes, there is https://wiki.jenkins.io/display/JENKINS/Mail+Watcher+Plugin but I am not sure that it exactly does what we want. It might also send emails when someone puts the agent off on purpose. And a personal email is much more likely to be handled by the admin quicker.

If we automate something, then directly the rebooting of the PCs if they are unreachable (maybe they even have some kind of watchdog built-in?)

a7 has been restarted and the "global C-State control" disabled.

It is, however, not online as build agent.

Let us see if it also crashes without load. v2 and i7 work again.

The admin (Herbert) is not available tomorrow, so I leave a7 off as build agent for now.

My plan (if there are no protests or a7 crashes before) is to turn a7 on as build agent tomorrow. If a7 crashes then again, Herbert can restart a7 on Friday. Is this okay?

If the queue isn't too long, I think we should keep the build agent on a7 disabled for a bit longer. The last crash happend after 3 days. If we enabled it tomorrow, we won't know whether the build agent caused the crash or not, unless it crashes before then.

Ok, then let us see how the queue size looks like.

I hope that "global C-State control" finally fixes the problem and I think we need high load to test it.

The queue was very long and the master builds were all hanging as they need a7 for website deployment.

So I started the a7 agent again.

Some of the recent Jenkins build jobs were canceled because of missing disk space on the main build server. I freed some space by removing logs of old build jobs. Please note that I might have also removed some log files of new build jobs. In some cases the Jenkins build for your latest commit might have failed and you now only see some message abut an 404 error. In that case, please just either

  • use jenkins build libelektra please in a comment below the PR to restart the Jenkins build, or
  • rewrite the last commit without changes (e.g. using git commit --amend) and do a force push

. Sorry for the inconvenience.

Thank you for maintaining it!

I marked node v2-debian-buster as temporarily offline, since it does not seem to work correctly. For more information, please take a look at issue #2995 (and issue #2994).

Thank you for looking for the infrastructure!

v2 was out of disc space. I executed docker system prune Total reclaimed space: 58.37GB

Then I rebooted v2 and made the agent connect again.

I now executed du -h | sort -h to find other files to be removed.

I started v2 again with a new Docker version. Please report broken builds immediately.

I reinstalled docker, purged all configs, and removed /var/lib/docker. Hopefully this fixes it.

v2 will be included again.Please report broken builds immediately.

As suggested here I now executed

ethtool -K enp3s0 sg off # on v2
ethtool -K enp0s25 sg off # on i7
ethtool -K enp37s0 sg off # on a7 (internal network interface)

and I also restarted i7 (there were many docker network interfaces, they are gone now)

docker-ce is now everywhere 5:19.03.1~3-0~debian-buster

Please report broken builds immediately.

Looks like the master build failed again, because of connection problems to v2-debian-buster (see also issue #2995).

I asked our admin to look at the switch between a7/v2/i7. I deactivated v2 and i7 for now.

I restarted libelektra/master and libelektra-daily.

We changed the ports for all 3 PCs.

Then I removed jenkins homedir on v2/i7 and restarted the v2/i7 agent.

Looks like there is no more space available on v2-debian-buster:

ApplyLayer exit status 1 stdout: stderr: write /app/kdb/build/src/tools/kdb/CMakeFiles/kdb-objects.dir/gen/template.cpp.o: no space left on device

.

Thank you for reporting, I made (much) more space on v2.

Remove job finished:

Filesystem Size Used Avail Use% Mounted on
/dev/sda3 417G 227G 164G 58% /

buildserver is down due to migration (so that we get consistent state in the backup of the new buildserver)

Load on build server was 200 due to kernel errors during a backup, the server did not react anymore and needed to be reset.

Log messages were (examples):

[87400.120008]  [<ffffffff810be6a8>] ? find_get_page+0x1a/0x5f

[87372.120005]  [<ffffffff81357f52>] ? system_call_fastpath+0x16/0x1b
[87372.120005] Code: f6 87 d1 04 00 00 01 0f 95 c0 c3 50 e8 d7 36 29 00 65 48 8b 3c 25 c0 b4 00 00 e8 d0 ff ff ff 83 f8 01 19 c0 f7 d0 83 e0 fc 5a c3 <48> 8d 4f 1c 8b 57 1c eb 02 89 c2 85 d2 74 16 8d 72 01 89 d0 f0
[87372.120005] Call Trace:
[87372.120005]  [<ffffffff810be6cc>] ? find_get_page+0x3e/0x5f
[87372.120005]  [<ffffffffa016962f>] ? lock_metapage+0xc2/0xc2 [jfs]

[87400.110012] BUG: soft lockup - CPU#0 stuck for 22s! [cp:15356]

Hopefully we can migrate soon in the beginning of next week (@Mistreated ?)

The server currently makes a resync of raid, so expect it to be very slow.

Jenkins builds are not performed anymore, see #3035

Jenkins now started again. Please repeat Jenkins build jobs.

Looks like v2-debian-buster is offline:

Opening SSH connection to a7.complang.tuwien.ac.at:22221.
Connection refused (Connection refused)

.

Thank you, I contacted our admin but I am afraid he will be out of office already.

Herbert already restarted v2 yesterday. He disabled "simultaneous multithreading".

If a server (v2, i7, a7) crashes again, please also directly contact our admin via "herbert at complang.tuwien.ac.at". Please also report here, to avoid multiple emails.

I think there is something wrong with the Git repository for the master branch on v2-debian-buster:

git fetch --tags --progress https://github.com/ElektraInitiative/libelektra.git +refs/heads/master:refs/remotes/origin/master +refs/heads/*:refs/remotes/origin/* --prune" returned status code 128:
stdout: 
stderr: error: object file .git/objects/9c/0bc3ca6fcbc610abd845aeff5f666938d24117 is empty
error: object file .git/objects/9c/0bc3ca6fcbc610abd845aeff5f666938d24117 is empty
fatal: loose object 9c0bc3ca6fcbc610abd845aeff5f666938d24117 (stored in .git/objects/9c/0bc3ca6fcbc610abd845aeff5f666938d24117) is corrupt
fatal: the remote end hung up unexpectedly

. I already restarted the build three times, but Jenkins always fails with the same error.

Unfortunately v2 has btrfs which sometimes seems to corrupt files. A similar problem we already had with a failing docker pull. In the current case, the file 0bc3ca6fcbc610abd845aeff5f666938d24117 seems to be corrupted. When running md5sum on the occurrences of this file, I get:

b9303a311bc8083deb57d9e5c70cde20  ./workspace/libelektra_PR-3038-NAC3HXDHQFTZWU7UCEHHPY5AOGDLHXYBZKKVUYJHDQR3VY4E7S4A@2/.git/objects/9c/0bc3ca6fcbc610abd845aeff5f666938d24117
b9303a311bc8083deb57d9e5c70cde20  ./workspace/libelektra_PR-3038-NAC3HXDHQFTZWU7UCEHHPY5AOGDLHXYBZKKVUYJHDQR3VY4E7S4A@2/libelektra/.git/objects/9c/0bc3ca6fcbc610abd845aeff5f666938d24117
d41d8cd98f00b204e9800998ecf8427e  ./workspace/libelektra_master-Q2SIBK3KE2NBEMJ4WVGJXAXCSCB77DUBUULVLZDKHQEV3WNDXBMA@2/.git/objects/9c/0bc3ca6fcbc610abd845aeff5f666938d24117

I now removed the whole directory for master and restarted the build. See also #3054

As I am now not available for a few days, please contact "herbert at complang.tuwien.ac.at" on issues regarding unreachable a7/i7/v2. @Mistreated will be responsible for everything that is not related to rebooting servers. (Hopefully we will get a new build agent soon.)

Please also always report here, to avoid multiple emails and so that everyone has a good overview of what is going on.

Does the build server currently have a malfunction?

It seems like the main Jenkins build server is unable to connect i7. I marked the node as temporarily offline.

The build fails on arbitrary cases:
Here it was interrupted without any reason
Here a test case fails which is unrelated to my PR (i just added a design decision w/o touching any code)

Here it was interrupted without any reason

I got the same interruption code 143 for two PRs and cannot explain them yet. I restarted the build and hope that it works now.

Here a test case fails which is unrelated to my PR

This should be fixed thanks to @sanssecours with #3103. Please rebase to master.

The new Jenkins node hetzner-jenkins1 does not seem to work correctly. I marked the node as temporarily offline.

I upgraded docker on i7 and restarted the machine. I hope this fixes the problem. The agent is online again. Please report problems here (and/or deactivate the agent).

Currently a job of #3065 is running on i7.

@Mistreated can you debug hetzner-jenkins1 please?

Is there a possibility to easily turn off a link check for some time?

This has been happening for the whole day:

doc/tutorials/snippet-sharing-rest-service.md:63:0 http://cppcms.com/wikipp/en/page/apt
doc/tutorials/snippet-sharing-rest-service.md:158:0 http://cppcms.com/wikipp/en/page/cppcms_1x_config
doc/news/2016-12-17_website_release.md:94:0 http://cppcms.com
doc/tutorials/snippet-sharing-rest-service.md:62:0 http://cppcms.com/wikipp/en/page/cppcms_1x_build

Other PRs (most recently #3115, #3113) are affected as well. According to downforeveryoneorjustme links really are not available.

UPDATE: The website is still offline. I made a PR for this #3117.

Is there a possibility to easily turn off a link check for some time?

You can turn off individual links by adding them to tests/linkchecker.whitelist (as you already found out)

I cannot rerun failed builds from Cirrus. See https://github.com/ElektraInitiative/libelektra/pull/3113
https://cirrus-ci.com/build/6562476467945472

The button does nothing. Is there a magic trick to get it working?

edit: either someone changed something or the x`th try finally worked. The build is running again! :)

Looks like the build agent a7-debian-buster is not able to relaunch:

…
[10/28/19 06:02:59] [SSH] Starting slave process: cd "/home/jenkins" && java  -jar slave.jar
<===[JENKINS REMOTING CAPACITY]===>channel started
Remoting version: 3.25
This is a Unix agent
Evacuated stdout

.

edit: either someone changed something or the x`th try finally worked. The build is running again! :)

After I saw your comment I did press the button “Restart Failed Build Jobs” too. As far as I can tell pressing the button did indeed restart the failing build jobs.

After I saw your comment I did press the button “Restart Failed Build Jobs” too. As far as I can tell pressing the button did indeed restart the failing build jobs.

It didnt work for me though, I will provide some gif next time!

I'll restart the build server and its nodes. Build jobs of #3121 and #3099 need to be restarted as they had jobs on dead agents.

It didnt work for me though, I will provide some gif next time!

You do not need to provided a GIF, since I already believe you 😊.

Seems like jenkins has troubles to stop, I wait a bit before I forcibly kill all Java processes.

I also upgraded docker on all agents (on i7 it was already upgraded).

Jenkins is up again with the heartbeat interval as suggested in #2984. All nodes are connected.

Please restart all jobs as needed and report any troubles here.

v2 is down, I asked our admin to restart.

I enabled v2 again, since it seems to be up and our build backlog is adequately sized.

Is there a problem with the build again (hetzner-jenkins1)?

Yes. I disabled the node.

It just Disk quota exceeded , I did not want to overkill it with memory. I cleaned it up now. Its up again.

Node updated.

I increased hetzner-jenkins1 to 4 parallel builds. This is only a temporary measure as long as nothing else is running there.

I temporarily decreased the number of executors on hetzner-jenkins1 from 4 to 2, as the testsuite is timing out. I think this happens when too many jobs are compiling while tests are running. We might need to limit resources available to a single container s.t. it does not interfere too much with other jobs.

Feel free to correct it if you think this was the wrong approach.

EDIT: decreased to 1 since tests are still timing out and constantly re-building wastes even more resources.

@mpranj Thank you for fixing it!

@mistreated did you maybe only assign a single CPU or similar? Can you assign more and turn the number of executors higher? The hardware should be stronger as v2.

I disabled i7-debian-buster because there is no disk space left which leads to all builds failing. If someone has access please clean something up and re-enable it.

@mpranj thank you for disabling!

Sorry, where I am currently are ssh is blocked (some application firewall, also ssh on other ports does not work). So I cannot give access or do any cleanup now.

As i7 is the weakest of the agents, it might not a big deal anyway.

@Mistreated did you maybe only assign a single CPU or similar? Can you assign more and turn the number of executors higher? The hardware should be stronger as v2.

I have no idea how strong is v2.
Currently jenkins1 uses 4 CPU with 8 memory and 16 swap. I can increase it easily, I just dont know to which point you want me to increase it.

A note for the future hardware decisions: phoronix seems to do compilation tests in their CPU articles (e.g. Ryzen 7 3700X, Ryzen 9 3900X test, towards bottom end of article).

Seems like hetzner recently added AMD Ryzen 7 3700X to their AMD based servers.

I have no idea how strong is v2.

@ingwinlu wrote about this in his thesis (to be found in abgaben repo lukas_winkler)

Currently jenkins1 uses 4 CPU with 8 memory and 16 swap. I can increase it easily, I just dont know to which point you want me to increase it.

As long as we do not have anything else running on the server you can allocate all resources. Later we still can go down (when we move Jenkins).

I updated the hetzner-jenkins1.
The error with the frontend where it runs out of memory is corrected.
Now it runs 2 parallel builds.

Looks like there is no space left on v2-debian-buster:

validation.cpp:69:1: fatal error: error writing to /tmp/cccJFleY.s: No space left on device

.

Thank you, I marked it offline as well, until somebody has access to clean it up.

hetzner-jenkins1 just failed my 3 PRs because disk quota is exceeded. Here is on output:

Starting pull/hub.libelektra.org/build-elektra-alpine:201911-78555f42df1da5d02d2b9bb9c131790fcd98511c3dea33c6d1ecee06b45fae55/ on hetzner-jenkins1
/home/jenkins/workspace/libelektra_PR-3106-LB35J55FSRLFKFEU2WP6AWVLM3IH4JWI6C5B57NWB6DDARN4JDUA@tmp/ff803792-a127-4b8f-8588-439af982c8a4: Disk quota exceeded

Marked hetzner-jenkins1 as offline because disk quota was exceeded.

I cleaned up i7 and v2 (by removing /home/jenkins/workspace/* and by running docker system prune). Now we have:

  • i7: /dev/mapper/i7--vg-home 199G 152G 37G 81% /home
  • v2: /dev/sda3 417G 255G 147G 64% /

Then I restarted the agents.

@Mistreated can you please fix #3160 so that this does not reoccur so fast. Please also fix hetzner-jenkins1. There are lots of resources on this machine it is really not necessary that it hits a resource limit every day.

I dont know if there is a nice way to resize the disk down. Thats why Im not giving the node everything at once.. hetzner is up again.

v2 was again out of space, I cleaned up: /dev/sda3 417G 315G 102G 76% /

@Mistreated https://build.libelektra.org/jenkins/computer/hetzner-jenkins1/log does not start up

I think the build system is currently completely stuck, PRs are not being built.

Thank you for reporting, I will restart Jenkins and cleanup some files as the disc is full.

@Mistreated https://build.libelektra.org/jenkins/computer/hetzner-jenkins1/log does not start up

Agent successfully connected and online

I assume everything is fine now with hetnzer-jenkins1?

Thank you so much! Seems like Jenkins is still not reacting to builds. v2 and i7 now both fails with: java.io.IOException: Could not copy slave.jar into '/home/jenkins' on slave.

Jenkins is up again, please restart the jobs.

java.io.IOException: Could not copy slave.jar into '/home/jenkins' on slave.

fixed (was also out of space)

@Mistreated please fix the jenkins-daily as this makes the cleanup tasks we now always need to do manually!

@Mistreated "jenkins build libelektra please" still is broken, is this related to the changes of the webhooks?

Maybe try today to change to the new Jenkins but if you are not able to do it, please make the old instance work again!

I triggered a repository scan, now the "jenkins build libelektra please" seem to work again.

Unfortunately. I marked v2 as offline until it is resolved.

Thank you for reporting!

@mpranj I gave you access, can you try to cleanup please?

Thank you! It seems to me that there is an abundance of resources wasted by docker old images lying around. Additionally it seems that btrfs + docker are buggy. Docker creates btrfs subvolumes for each container and does not clean it up propery afterwards. The docker system prune -f command does not free up the space either.

I took v2 and a7 down for maintenance to free the resources and balance the btrfs.

docker login failed

Build cant pull docker images. Something going on with docker hub?

Yes, sorry, with a7 I also took down the docker hub. I will post a message here when it's up again.

a7 including docker hub is up again. I left v2 offline because it can not login to the hub to pull images?!? I don't know what's wrong there, I did not change any credentials or so and the other nodes can login. Any ideas?

Btrfs is still balancing in the background, a7 may be slower for another hour or so.

@mpranj thank you for fixing this! Which commands were needed for btrfs re-balancing?

Unfortunately, I do not know the credentials, I hope @ingwinlu can help us out.

The commands I used to re-balance were:

Fix a maybe-bug with btrfs:

btrfs balance start -dusage=0 -musage=0 /mountpoint

Re-balance the fs really, this takes a long long time. The usage parameter can/should be tuned, this is what worked today:

btrfs balance start -dusage=80 /

The credentials can be changed easily, but we'd have to log in all jenkins agents which connect to the docker hub again with the new credentials.

The bigger problem was that some docker containers were still running and docker system prune didn't do much. Therefore I took the agents down and freed everything up while it was down. There were TONS of containers just lying around.

Yes, unfortunately the containers are quickly recreated. I hope @Mistreated can fix the libelektra-daily job soon (it executes docker system prune).

I also did some quite involved digital forensics and stole the hub credentials. :laughing:

v2 is up and running again.

Thank you so much! :100: Please send us the credentials.

Sent. Btw I think a7 is probably slow only because of poor disk speeds, but it's good that it has enough space for the docker hub. Seems like much of the time the CPU is doing nothing there.

Another thought: maybe we can do the critical cleanup jobs additionally per cronjob to avoid situations like we had now.

Please send us the credentials.

@mpranj I think I was in this group of "us". I wasnt in some CC or something like that?

@Mistreated sorry, I sent it to markus and didn't have your email. On a7 you will find CREDENTIALS.txt in your homedir.

I need hetzner-jenkins1 Node to test the new jenkins-server. Im gonna turn it off on the old server until tomorow morning.

You can easily create a second hetzner-jenkins2 for tests. If it is only for this night, it should be okay though.

Another thought: maybe we can do the critical cleanup jobs additionally per cronjob to avoid situations like we had now.

libelektra-daily does this cleanup jobs but it fails now: #3160. If you have ideas to improve this job, please tell us.

I think I was in this group of "us". I wasnt in some CC or something like that?

Yes, sorry I forgot to tell mpranj that "us" refers to you.

I hope that its ok that I keep hetzner-jenkins1 for a little while, all builds are good now I think i can make the server fully running tonight.

v2 is unreachable, I contacted the admin.

I hope that its ok that I keep hetzner-jenkins1 for a little while, all builds are good now I think i can make the server fully running tonight.

This would be great!

v2 is unreachable, I contacted the admin.

Thank you but I am afraid he will not respond before Monday.

v2 gets a new kernel (it just crashed).

i7 will also be restarted.

All 3 servers (v2, a7, i7) now have "Linux v2 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64 GNU/Linux"

They are up and online, please restart jobs if needed.

Just a note:
I scanned the repo again with the new server. This could make some errors on the old one..

Seems like master [1] was also built on the new server. It was not successfully. When clicking on the status, a login page appears [2]. Please reconfigure Jenkins that everything can be viewed without being logged in.

Hopefully we can switch to the new Jenkins soon. Seeing errors from two different Jenkins does not make the situation easier :wink:

[1] https://github.com/ElektraInitiative/libelektra/commits/master#
[2] http://95.217.75.163:8080/login?from=%2Fjob%2Flibelektra%2Fjob%2Fmaster%2F1%2Fdisplay%2Fredirect

@markus2330 can a7 / v2 be rebooted remotely after an upgrade or are there some pitfalls?

Usually it works but if it is not urgent it is better to wait until the admin is there. I can reboot on Tuesday if this is okay for you?

Thank you! Nothing urgent, just a general question. It came up because Debian 10.2 was just released. I'll wait a bit with the upgrades.

You can do the upgrade nevertheless (only without reboot). Then, in the case of a crash, we will already have the 10.2 kernel when the admin will press the reset button :wink:

@mpranj can you maybe add a cronjob that purges the old snapshots? Or is this not possible without stopping docker?

https://docs.docker.com/storage/storagedriver/btrfs-driver/ recommends to also rebalance btrnfs in a cronjob.

can you maybe add a cronjob that purges the old snapshots? Or is this not possible without stopping docker?

I can add a cronjob without stopping all docker containers. This might not clean up everything, but we can try it. Like I said, sometimes containers keep running forever/until the machine crashes. The complete cleanup requires us to temporarily disable the build agent though, then we can force-stop all containers.

I can also add the rebalance as a cronjob.

Thank you, let us try it.

Master is out of memory. I wanted to run Scan Repository on the old Jenkins because a7 and i7 get following error on pulling docker images:

docker login failed

I got v2 and hetzner-jenkins1 running on the new server now.

Master is out of memory.

Thanks for reporting. I removed some old coverage data and enabled the master node again. For everyone with open pull requests: Please restart your Jenkins builds with jenkins build libelektra please. Sorry for the inconvenience.

In #3234 @raphi011 suggested:

imo this is really urgent, the flakiness and slowness of the tests make it hard if not impossible to do any changes if you have to wait this long to verify them.

I agree it is really urgent but @Mistreated already does what he can.

So maybe we can use the build server more sparingly and only build if we really think the PR is to be merged soon. Unnecessary builds should be canceled.

Or what about (temporarily) stop automatic building of PRs on pushing of changes (so the build starts only with jenkins build libelektra please)? @Mistreated do you know how to reconfigure Jenkins to do so (I did not find the option)?

I also have the feeling that jenkins build libelektra please does not work at the moment, at least it didn't work for this build: https://github.com/ElektraInitiative/libelektra/pull/3073 i had to push an empty commit to start the pipeline.

Cleanup cronjob implemented and backport kernel upgraded on a7 and v2. There's quite a changelog for the kernel, it will be active on next reboot.

Thank you very much! Is the "old" backport kernel still installed so that we have a fallback if it doesn't boot?

Yes, it can be removed after a successful boot into the new kernel.

Starting pull/hub.libelektra.org/build-elektra-alpine:201911-78555f42df1da5d02d2b9bb9c131790fcd98511c3dea33c6d1ecee06b45fae55/ on i7-debian-buster

docker login failed

https://build.libelektra.org/jenkins/blue/organizations/jenkins/libelektra/detail/PR-3244/1/pipeline

I disabled i7 for a manual cleanup, kernel and docker upgrade. Somebody enabled i7 while I was working on it. Everything is up and running again.

@Piankero I restarted your build now.

I also have the feeling that jenkins build libelektra please does not work at the moment, at least it didn't work for this build: #3073 i had to push an empty commit to start the pipeline.

works now

@Mistreated do you know how to reconfigure Jenkins to do so (I did not find the option)?

I added the following to Jenkins Configuration:

Suppress automatic SCM triggering

Note to everyone: Use of "jenkins build libelektra please" is now mandatory, build jobs do not start by simply pushing. We will inform here, when we revert this setting.

@Mistreated thank you! Let us see if this enough. I hope pushes to master still trigger the master builds?

Having the hetzner node would be very good, nevertheless. Are there any problems if the node is used by two build servers at the same time? Or if it is a problem: isn't it very easy to simply clone the CT?

@Mistreated thank you! Let us see if this enough. I hope pushes to master still trigger the master builds?

master branch is now an exception of the following rule:

Suppress automatic SCM triggering

As for

Having the hetzner node would be very good, nevertheless. Are there any problems if the node is used by two build servers at the same time? Or if it is a problem: isn't it very easy to simply clone the CT?

I added a new CT (hetzner-jenkinsNode3).

could not clone the repo: https://build.libelektra.org/jenkins/blue/organizations/jenkins/libelektra/detail/PR-3073/8/pipeline/634/

maybe this has something to do with the new node? (wild guess)

maybe this has something to do with the new node? (wild guess)

this error is on a7

this error is on a7

soooo .. retry?

soooo .. retry?

yeah, I dont think it will happen again..

I will rerun it for you.

@Mistreated I think we can start automatic builds again. But please look at #3160 first.

any idea why this would fail?

go: github.com/google/[email protected]: Get https://proxy.golang.org/github.com/google/uuid/@v/v1.1.1.mod: net/http: TLS handshake timeout

https://build.libelektra.org/jenkins/blue/organizations/jenkins/libelektra/detail/PR-2827/8/pipeline/648

any idea why this would fail?

Sounds like it was a temporary problem, the URL is currently accessible from the build agents.

In the long run it would be great to set up those kind of dependencies in the docker images, to avoid downloading them for each build repeatedly. That should also prevent build failures due to temporarily unavailable packages like you mentioned above.

Well then.. jenkins build libelektra please the third

Yes, we have all dependencies directly in the docker images exactly for this reason. I created #3251

I took v2 and a7 offline for rebooting.

@markus2330 if you get a chance, enable hyperthreading on a7.

v2 is up again, on a7 there is still a buildjob.

I took nodes and added them to the new server. I am gonna let it run over night. Tomorow I will return the nodes if there are further errors on new Jenkins server.

hetzner-jenkinsNode3 will still run on the old Jenkins.

Tomorow I will return the nodes if there are further errors on new Jenkins server.

Small build errors are not a reason to switch back. At some point we need to fix the errors, the going back and forth is very time consuming.

What might be a show-stopper, however, is that the new server is not reachable. (Neither
http://95.217.75.163:8066 nor ssh). I pressed the power button, let us see if the machine restarts. We should investigate what was the problem, though.

http://95.217.75.163:8066

If there is time please enable TLS using letsencrypt, so we don't leak credentials and expose ourselves to various other problems?

Thank you for the input! I would suggest we do this immediately when we switch build.libelektra.org. Otherwise we have double-efforts.

Is this error known? Caught the following exception: null

Looks like the formatting check failed and the other builds were terminated.

you're right. what a crappy error message though :P

@Mistreated can you please again activate that PRs are automatically built? Due to the many agents the server is now sleeping most of the time.

@Mistreated can you please again activate that PRs are automatically built? Due to the many agents the server is now sleeping most of the time.

Done.

I am gonna borrow hetzner-jenkins1 and v2 again for the new server.

You do not need to give it back, I hope we can do the switch today.

Tip: when doing these kind of switches, it's good to decrease the TTL of the DNS entry to something unusually low (e.g. 60 instead of currently 21599 for build.libelektra.org). After the change is propagated it should let us switch the DNS entry within a minute instead of hours. If it's too late it can help to clear DNS caches of google and opendns, but some people will inevitably see the old resource until the cached entries expire globally.

EDIT: after the change the TTL should obviously be reverted to some sane value to put less load on the DNS.

Even though it is now maybe too late, I switched $TTL 3600 (if we need several changes until everything works).

www-new and build-new already exists pointing to the new server.

I now switched doc.libelektra.org. @Mistreated will fix the publishing. I will look into www-new.libelektra.org

https://build-new.libelektra.org/ and https://www-new.libelektra.org/home should work now.

I'll change all DNS entries now.

All DNS entries are changed.

Unfortunately, certbot fails as it seems to speak with the old server but this seems to only affect download and community (lesser used URLs).

So hopefully, during/after the weekend everyone sees the updated DNS names.

@Mistreated please update the publishing of all artifacts: also for the website. Please create a PR to make sure everything is working properly.

The old build server is now shut down.

I need to restart the new server (new kernel and network bridge added).

Server is up again with Linux pve 5.0.21-5-pve.

I scheduled a rescan of all PRs.

server offline due to misconfiguration/bug in pve (/etc/network/interfaces was deleted by GUI?).

Bug was that renaming of network devices (which was caused by my action in the GUI) lead to a kernel OOPS:

Nov 23 21:32:08 pve kernel: [ 1682.138250] veth4d0199f: renamed from eth0
Nov 23 21:32:19 pve kernel: [ 1693.378374]  __x64_sys_newlstat+0x16/0x20
Nov 23 21:32:19 pve kernel: [ 1693.378380] Code: Bad RIP value.
Nov 23 21:32:19 pve kernel: [ 1693.378382] RDX: 00007fa58b238e20 RSI: 00007fa58b238e20 RDI: 00007fa58ba50d24
Nov 23 21:32:19 pve kernel: [ 1693.378383] R13: 0000000000000294 R14: 00007fa58ba50cc8 R15: 00007ffe65c2b158
Nov 23 21:34:20 pve kernel: [ 1814.210370]  request_wait_answer+0x133/0x210
Nov 23 21:34:20 pve kernel: [ 1814.210374]  fuse_simple_request+0xdd/0x1a0
Nov 23 21:34:20 pve kernel: [ 1814.210378]  ? fuse_permission+0xcf/0x150
Nov 23 21:34:20 pve kernel: [ 1814.210381]  path_lookupat.isra.47+0x6d/0x220
Nov 23 21:34:20 pve kernel: [ 1814.210385]  ? strncpy_from_user+0x57/0x1c0
Nov 23 21:34:20 pve kernel: [ 1814.210388]  __do_sys_newlstat+0x3d/0x70
Nov 23 21:34:20 pve kernel: [ 1814.210392]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Server should be up and running again.

Previous problems remain, though (phrase not working #3268)

@Mistreated master also does not seem to build automatically anymore, I trigger it manually now.

I am collecting urgent errors in #3268. It would be good if you can also test if everything works as described in doc/BUILDSERVER.md.

Thank you for reporting!

@Mistreated could you please install Naginator+Plugin #2967 as already discussed many times? (Please make a snapshot before changes to Jenkins.)

hetzner-jenkins1: disk quota exceeded

@Mistreated could you please install Naginator+Plugin #2967 as already discussed many times? (Please make a snapshot before changes to Jenkins.)

Gonna do it today

hetzner-jenkins1: disk quota exceeded

@mpranj I added the new VM as a build agent while hetzner-jenkins1 is down.

I cleaned up some space on hetzner-jenkins1 by running docker system prune -a and enabled it again.

Seems like there is again a problem that lots of stuff is not cleaned up by docker system prune -f. This time the storage driver was not btrfs but vfs. :confused:

I added the new VM as a build agent while hetzner-jenkins1 is down.

The idea now is that we do not use the container anymore but only the VM instead.

I cleaned up some space on hetzner-jenkins1 by running docker system prune -a and enabled it again.

Thank you very much! Can you also make a cronjob there? (on the VM, not on the container).

make a cronjob there?

Done.

@Mistreated could you please install Naginator+Plugin #2967 as already discussed many times? (Please make a snapshot before changes to Jenkins.)

We have to move from pipeline to freestyle job, if we want the naginator plugin. I am gonna look for alternatives.

Builds on the VM jenkinsNode3VM currently fail. They are unable to docker pull:

unexpected EOF
script returned exit code 1

I disabled it for now until someone can fix the problem.

[cronjob] Done.

Thank you!

We have to move from pipeline to freestyle job, if we want the naginator plugin. I am gonna look for alternatives.

Yes, good idea. Maybe it is best to simply code that in our Jenkinsfile. So if a problematic build job/stage fails, that it is retried. That these docker pulls are tried at least twice as this is one of the most frequent problem.

Builds on the VM jenkinsNode3VM currently fail. They are unable to docker pull:

@Mistreated please fix this.

Builds on the VM jenkinsNode3VM currently fail. They are unable to docker pull

I fixed the docker image that could not be pulled.

Thank you very much! It is always helpful if you write what was wrong and how you fixed it.

I have no idea what was wrong, I built the image manually on the agent. Since the agent could not pull it.

As for Dockerfile (scripts/docker/debian/stretch) my Visual Code says it has 2 emtpy lines on the end, but vim says its only one. I dunno does it have to do something with mistake above, maybe its just my VS.

Seems like we have problems with our docker registry (#3316 docker pull fails with unexpected EOF).

Since the dust after the release has settled and there are no builds going on I would suggest stopping everything and trying to clear the registry completely. After that all images should be re-built hopefully cleanly. I would backup the registry data before I start just to make sure, but I hope that a clean start gets rid of some errors we were having.

I'll wait for comment whether there's anything against that before I start.

I think its a problem in the

(scripts/docker/debian/stretch)

image, since it is the only one failing.

I've built it manually again, but there certainly is something wrong with the image in the registry.

Jenkins reports: jenkinsNode3VM (offline)

@Mistreated would be great if you could setup some way of monitoring.

a7 (and therefore v2 and i7) are down. I contacted the admin.

EDIT markus2330: up again

Because of a planned power outage on 08.07.2020 at TU Wien, our admin plans to shutdown all build servers on the day before (Tuesday 7.7). Builds will be very slow during that time, so please only push on that day in urgent cases.

Servers are again online except of i7. I notified the admin.

There was another power outage (unplanned) yesterday in the evening, thus all the build servers are down now. The admin is working on it.

EDIT 30 min later: everything, including i7, is up again :rocket:

@markus2330 it seems that v2 and i7 have lost internet access recently (possibly during the power outage?). Are you aware of any config changes we should have made, since the interfaces are configured statically?

I do not know about any change, only about that the computers were turned on again after these two power outages (one planned, the other unplanned).

But you are right, I also see that they these two (but not a7) do not have Internet connection anymore, although they are reachable. I asked our admin about that.

Maybe we disconnect them from Jenkins until this problem is fixed?

Thanks for contacting the admin! I disconnected both i7 and v2 until the problem can be resolved. (Builds don't work anyway because they can not pull docker images)

Someone changed something with the router about one week ago. The person was notified and hopefully will fix it soon.

Let's keep them disconnected for now.

The Internet problem is resolved now and I also installed the security updates on these machines.

@mpranj can you please switch them on again?

Thank you @markus2330.

All nodes are now back online.

I rebooted the build server for the latest PVE kernel. Jenkins should be up shortly.

I moved

  • [ ] make sending emails if build fail more reliable
  • [ ] Docker image without Jenkins user
  • [ ] centOS/fedora/arch docker images
  • [ ] centOS packages
  • [ ] freebsd/openbsd/solaris build agents

to #3519 and linked to #3519 above.

@robaerd now also has access to a7/v2/i7 and can contact the admin in the case of troubles.

Just a quick report of build times (main pipeline libelektra):

  • with a7 enabled: 2h 29m 24s
  • with a7 disabled: 1h 35m 45s

Why does Jenkins display that it gets shut down? Would be good to always read here in advance :wink:

Jenkins is going to be shut down, because of a full system backup and reformat of the filesystems from btrfs to ext4.

Jenkins is up again

Jenkins CI will be offline for maintenance from around 11:15 CET today.

We will perform some backup and cleanup tasks and try to improve performance of a7.

You will be notified again when the maintenance is over.

Jenkins CI and all build servers are up again. a7 should perform a lot better now, but has less storage capacity.

Please report if you experience any errors.

Jenkins CI and the build agents will be offline for a short maintenance/updates. Should be done in a matter of minutes.

EDIT: updates are done.

The server is down, I investigate.

The server is up again.

Official statement about the cause: "There was an issue with the PSU in the adjacent server which have caused the server to shut down; it has now been corrected."

The ssd in a7 is full, causing all builds on it to fail.

I will try to free up some space. Is is safe to disconnect the a7 build agent from jenkins for now?

Thank you for looking into it!

I will try to free up some space. Is is safe to disconnect the a7 build agent from jenkins for now?

Of course. On the contrary, it would be unsafe to keep it connected if it makes all builds fail.

Running docker system prune -a cleaned up roughly 50% of the space again. Maybe we need to adapt the existing cronjob to add the -a flag?

The jenkins home is using lots of space as well.

The master builds (full pipeline with deb packages and website building) is somehow still failing, even though everything looks green. Any ideas?

The upload of the focal deb packages to community fails on the file elektra_0.9.3.orig.tar.gz. It's probably a permission issue on the file. I will remove it from the directory for now and let it be recreated in the next run.

Somehow the sshPublisher if it fails doesn't set the stage to red.

Maybe we need to adapt the existing cronjob to add the -a flag?

Was there any reason why you didn't do it like this? If not, sounds like a good idea.

Jenkins CI and the registry on a7 will be offline for the migration of the watchtower image to a new version. v2 will also be cleaned since the disk is full. Should only take a couple of minutes.

EDIT: Updates are done and Jenkins CI is up again

Jenkins and the agents will be down briefly for an update. Everything should be up again in a couple of minutes and this post will be updated/edited.

EDIT: everything was updated and is up&running again. I needed to correct a problem where a7 was using Debian stretch docker packages instead of buster. I also cleaned up some space.

Builds are failing as there is no space on a7.

Build infrastructure will be unavailable for a few minutes for maintenance. Everything should be up&running again within 10 minutes.

Build infrastructure is available again.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mpranj picture mpranj  ·  3Comments

sanssecours picture sanssecours  ·  4Comments

markus2330 picture markus2330  ·  3Comments

sanssecours picture sanssecours  ·  3Comments

markus2330 picture markus2330  ·  3Comments