libelektra 🚀 - Build Server stuff

@markus2330

Just pushed a few build system related fixes. But you need to fix some packages on your stable debian-stable machine as well:

Please install qtdeclarative5-dev from wheezy-backports (you can remove /opt/Qt5.3.0 afterwards)
Please install java8 as package:
- Use this method: http://www.webupd8.org/2014/03/how-to-install-oracle-java-8-in-debian.html
- Let cmake actually find jdk8: cd /usr/lib/jvm/ && ln -s java-8-oracle default-java
- echo -e "/usr/lib/jvm/java-8-oracle/jre/lib/amd64\n/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server" > /etc/ld.so.conf.d/java-8-oracle.conf && ldconfig
- kill + restart the local jenkins java process. Otherwise all builds will fail
- Optional: Remove jdk7

manuelm on 16 Dec 2014

Looks good, thanks for fixing those issues.

I also did those steps on debian-stable agent.

For other machines installing qtdeclarative5-dev was not possible, because it conflicts with qdbus which is needed by kde4. So I restored the previous script configure-debian-wheezy as configure-debian-wheezy-local.

I also added the installation steps you mentioned as notes in the README.md because they might be of interest by others.

markus2330 on 17 Dec 2014

Thanks for upgrading the agents!

Stuff that is missing on stable

1.) latex (+ I think texlive-latex-recommended is needed, too)
see http://build.libelektra.org:8080/job/elektra-doc/495/console

-- Found Doxygen: /usr/bin/doxygen (found version "1.8.8") 
CMake Warning at doc/CMakeLists.txt:46 (message):
  Latex not found, PDF Manual can't be created.


-- Found Perl: /usr/bin/perl (found version "5.20.2") 
-- Configuring done
-- Generating done
CMake Warning:
  Manually-specified variables were not used by the project:

    BUILD_EXAMPLES

2.) Can you install clang (for elektra-clang, clang of wheezy won't work)?
3.) Can you install mingw+wine for elektra-gcc-configure-mingw?

markus2330 on 7 Sep 2015

apt install --no-install-recommends doxygen-latex + clang + mingw done

Why do you need wine?

Btw, you should change i586-mingw32msvc-X to i686-w64-mingw32-X in Toolchain-mingw32.cmake. Right now this won't work on unstable.

manuelm on 7 Sep 2015

Thank you for docu!

wine is needed to execute the cross-compiled windows binaries (e.g. exporterrors.exe)

I think you installed the mingw that builds for w64. In the mingw32 package, there is still a /usr/bin/i586-mingw32msvc-c++.

A new toolchain file for w64 is nevertheless appreciated.

markus2330 on 8 Sep 2015

I installed gcc-mingw-w64-i686 which is the x64 build of mingw with i686 as target.
The package mingw32-binutils is deprecated and not available on unstable any more.

manuelm on 8 Sep 2015

Wine installed on both containers.

manuelm on 8 Sep 2015

Actually, the mingw build is bound to stable, so that should not be an issue.

MinGW-w64 is a fork on mingw and is quite a different target. Nobody tested it up to now.

markus2330 on 8 Sep 2015

thanks for installing wine

markus2330 on 8 Sep 2015

Mingw-w64 looks superior. Maybe it's time to move on :-)

manuelm on 8 Sep 2015

Contributions welcome ;) I do not have any machine to test it.

markus2330 on 8 Sep 2015

I am afraid you got the wrong wine, it should be apt-get install wine32

see also http://build.libelektra.org:8080/job/elektra-gcc-configure-mingw/218/console

markus2330 on 8 Sep 2015

Nope.

root@debian-stable:~# apt-get install wine32
....
E: Package 'wine32' has no installation candidate

manuelm on 8 Sep 2015

ok, dpkg --add-architecture i386 will solve this. But can't you just pin the mingw/wine job to your build machine? The mingw setup is rather special.

Edit: I'll see if I can get elektra build with mingw-w64 so I don't need to install tons of i686 libs.

manuelm on 8 Sep 2015

The problem is I do not have a spare jessie machine and wheezy's mingw does not know C++11.

markus2330 on 8 Sep 2015

I managed to get mingw-w64 working. However std::mutex is not available because there's no glibc on windows and std::mutex depends on pthreads. Any ideas?

manuelm on 10 Sep 2015

Wow, thanks!

Does it lead to a compilation error? The std::mutex is not used for internal
functionality, but only in a header file to be included by a user. It is used
in test cases though.

One solution for compiliation problems is to provide a std::mutex in the mingw
case throwing system errors on every attempt to lock/unlock. Actually, I would
expect the mingw people to at least provide something like that (e.g. when
some macro is set, similar to -D_GLIBCXX_USE_NANOSLEEP)

https://github.com/meganz/mingw-std-threads might be another way. But that is
most likely only useful if all test cases except the ones involving std::mutex
already run.

Basically, this is only one instance of C++11 not properly available.

markus2330 on 11 Sep 2015

mingw status right now:

added dlfcn-win32 as external project to libloader. this way cmake checks out + compiles the library as an additional build step. I'm linking the archive to avoid additional dll deps.
added winsock2.h/ws2_32.dll dependency to cpp11_benchmark_thread. required by gethostname()-call

Right now I'm building with -static-libgcc + -static-libstdc++. Otherwise wine is unable to find the dlls. Additional mutex doesn't compile either. I tried mingw-std-threads. just got more compile errors :-)

If I switch from x86_64-w64-mingw32-X to x86_64-w64-mingw32-X-posix std::mutex compiles fine, because pthread stuff is defined. However I get an additional dependency to libwinpthread-1.dll, which wine is unable to find.

I think our best bet is using x86_64-w64-mingw32-X-posix though.

manuelm on 11 Sep 2015

Again, I am surprised that you even have this problem. Up to now we were happy when we got a libelektra.dll.

I cannot say anything about this x86_64-w64-mingw32-X-posix decision, because I do not use it and do not know the implications. I am wondering that such a posix-lib even exists, I thought that the posix-layer approach is cygwin and not mingw.

Does this decision even has an effect on libelektra.dll? If its only for the test cases, no one will care (as long as the build server is able to run it). If the test cases run, it will be a huge benefit. (See #270 where the unit tests unveiled some strange bugs on Mac OS X)

It seems like that libwinpthread-1.dll can be downloaded, I do not know if they work with wine though? Can you also add it as external project like done with dlfcn-win32 (so that all dlls are handled in the same way)? Otherwise, if you need to download 1 or 3 dlls for the tests might not really matter (again, I am no user, and do not understand the deployment concept, if there is any, of windows dlls).

@beku What do you think? Do you have time to test our latest 0.8.13 mingw-w64 build on Windows together with oyranos?

markus2330 on 11 Sep 2015

Are tests usually enabled for the mingw build job? Yesterday all of them were disabled.

manuelm on 11 Sep 2015

Yes, they were disabled. But afaik examples/benchmarks like cpp11_benchmark_thread were disabled, too. So I thought you changed it and compile more than it was done previously.

markus2330 on 11 Sep 2015

I compiled the whole repo with C++11 enabled. Nothing more.

But executables like bin/basename.exe built with -posix run fine as long as you copy the required dlls to the bin directory (thank you windows for not having RPATH). I haven't found a way to a) let cmake find the dll directory + b) point wine to the dll directory.
I thought static linking would work but then the build fails with duplicate symbols during linking the elektra dll. Because the dll already has the symbols included.

manuelm on 11 Sep 2015

@markus2330 I managed to get elektra to compile with mingw + running with wine without copying any dlls. The trick is to always enable static linking for both executable AND shared objects (CMAKE_SHARED_LINKER_FLAGS/CMAKE_EXE_LINKER_FLAGS => "-static").

To work around duplicated symbols I've added version-scripts for libelektra and libelektratools. This way only our symbols get exported.

This works really fine. e.g.

$ wine64 ./bin/kdb-static.exe
Usage: Z:\home\manuel\build\bin\kdb-static.exe <command> [args]

Z:\home\manuel\build\bin\kdb-static.exe is a program to manage elektra's key database.
Run a command with -H or --help as args to show a help text for
a specific command.

Known commands are:
check   Do some basic checks on a plugin.
convert Convert configuration.
cp      Copy keys within the key database.
export  Export configuration from the key database.
file    Prints the file where a key is located.
fstab   Create a new fstab entry.
get     Get the value of an individual key.
[...]

$ wine64 bin/cpp_example_iter.exe
user/key3/1
user/key3/2
user/key3/3

Even bin/cpp11_benchmark_thread.exe works.

Other things just crash:

$ wine64 ./bin/kdb-static.exe get
wine: Unhandled page fault on read access to 0x00000000 at address 0x7fd0e8b62c8a (thread 0009), starting debugger...
Application tried to create a window, but no driver could be loaded.
Make sure that your X server is running and that $DISPLAY is set correctly.
Unhandled exception: page fault on read access to 0x00000000 in 64-bit code (0x00007fd0e8b62c8a).
Register dump:
 rip:00007fd0e8b62c8a rsp:000000000033f428 rbp:0000000000000000 eflags:00010293 (  R- --  I S -A- -C)
 rax:0000000000000000 rbx:000000000033f700 rcx:0000000000000000 rdx:000000000033f5b0
 rsi:0000000000000000 rdi:0000000000000000  r8:0000000000000000  r9:0000000000000072 r10:0000000000000000
 r11:000000000003f615 r12:000000000033f5b0 r13:00000000000373b0 r14:0000000000000000 r15:000000000033f930
Stack dump:
0x000000000033f428:  00007fd0e748ea93 0000000000000000
0x000000000033f438:  0000000000000000 0000000000000000
0x000000000033f448:  0000000000000028 0000000000010020
0x000000000033f458:  8d98315017c96400 6f46746547485300
0x000000000033f468:  687461507265646c 0000000000000000
0x000000000033f478:  0000000000000000 0000000000000000
0x000000000033f488:  000000000003fab0 0000000000030000
0x000000000033f498:  8d98315017c96400 6f46746547485300
0x000000000033f4a8:  687461507265646c 0000000000000000
0x000000000033f4b8:  0000000000000000 0000000000000000
0x000000000033f4c8:  0000000000000000 0000000000000000
0x000000000033f4d8:  0000000000000000 0000000000000000
Backtrace:
=>0 0x00007fd0e8b62c8a strlen+0x2a() in libc.so.6 (0x0000000000000000)
  1 0x00007fd0e748ea93 MSVCRT_stat64+0x92() in msvcrt (0x0000000000000000)
  2 0x00000000004744af in kdb-static (+0x744ae) (0x000000000003f9d0)
  3 0x000000000043bda5 in kdb-static (+0x3bda4) (0x000000000003f9d0)
  4 0x0000000000431d76 in kdb-static (+0x31d75) (0x00000000000360a0)
[...]

Right now I've simply added the version-script stuff without thinking about other compilers. Shall I continue my work or not interested?

manuelm on 11 Sep 2015

crashes in src/plugins/wresolver/wresolver.c because pk->filename is NULL

manuelm on 11 Sep 2015

pk is of type resolverHandles.user

I tried to take a look at the plugin but I fail to understand the for-loop in elektraWresolverOpen. The loop calls elektraWresolveFileName --> elektraResolve{Spec,Dir,User,System} which all malloc resolverHandle->filename and therefor leak memory.

manuelm on 11 Sep 2015

Thanks for pointing that out! The code is obviously broken since the introduction in c87ae8e87a716b02b2c7ed790ad56a89d95547a9
During looping only and always the system handle was initialized. This lead to crashes when another namespace was used.

I fixed it in
edb4d50856bb5331749220de5a83fa2062624a9d

About continuing work: On the one hand, it would be nice if the compiled stuff also runs. On the other hand, the release should happen this weekend, so a pull request would be important soon (there should be at least a chance of a short feedback circle, e.g. what the version-scripts actually does)

But imho its enough if only one variant (the static compilation) works. Great to see the kdb-tool running!

markus2330 on 12 Sep 2015

Where can I find edb4d50856bb5331749220de5a83fa2062624a9d?

manuelm on 12 Sep 2015

edb4d50856bb5331749220de5a83fa2062624a9d was pushed a bit later.

markus2330 on 12 Sep 2015

Which gcc versions are installed on debian-unstable-mm?

http://build.libelektra.org:8080/job/elektra-multiconfig-gcc-unstable/build_type=Release,gcc_version=5.2,plugins=ALL/56/console

says there is no gcc-5.2

Can you install as much compilers as possible, please?

markus2330 on 18 Sep 2015

In some issue or PR I've said that I've removed all compilers except the latest.
Edit: gcc 4.9 on stable, 4.9 + 5.x (default) on unstable

Please do these kind of tests (I find them highly unnecessary anyway) on your own containers. Mine won't stay forever anyway.

manuelm on 18 Sep 2015

I have not read that. They have maybe 50MB each. Could you please install them again and answer the first question?

markus2330 on 18 Sep 2015

Maybe I told you in our meeting. But I've definitely told you.

debian-unstable:~ # gcc -v 2>&1 | tail -n 1
gcc version 5.2.1 20150903 (Debian 5.2.1-16)

The version specific binary is called gcc-5. No separate package for minor versions anymore. So your multiconfig-gcc with this level of detail is kind of obsolete. I recommend removing gcc 4.7 and replacing gcc-5.2 with gcc-5 and be done.

The only additional compiler available I haven't installed is gcc-4.8. And gcc-4.8 has already been tagged for removal.

manuelm on 18 Sep 2015

Thanks for the info! Seems like the glory days of many available compilers is over.

I fixed multiconfig-unstable.

markus2330 on 18 Sep 2015

I will close for now, thanks for the excellent agent setup.

markus2330 on 18 Sep 2015

Hello, jessie (stable) needs some more packages. Could you please install:

[ ] fakeroot
[ ] gpg (+ create key for Autobuilder [email protected])
[ ] reprepro (maybe already installed, script did not go so far)

markus2330 on 29 Sep 2015

fakeroot installed, gpg + repropro is already installed.
Can you mail me your already existing gpg key? So both build machines have the same

manuelm on 29 Sep 2015

Its ok to have different gpg keys. I am not sure if the current setup uses them at all, so first wait if http://build.libelektra.org:8080/job/elektra-git-buildpackage-jessie/2/ fails.

markus2330 on 29 Sep 2015

debhelper + libsystemd-journal-dev installed
python-dev is a wrong dependency. it should be python2.7-dev or python3-dev or both
why to we need python-support?

manuelm on 29 Sep 2015

Thanks for installation!

python-dev is available for Jessie, and python-support, too. Please install them.

I tested it locally, when these packages are installed, it builds for jessie.

markus2330 on 29 Sep 2015

Sure, it's available but it's a wrong dependency. python-dev depends on python2.7-dev which is _not_ sufficient. Instead python2.7-dev + python3-dev is required.

python-support isn't required at all imho.

manuelm on 29 Sep 2015

I do not know why the dependencies were chosen this way, most of the packaging was done by @iandonnelly during gsoc.

markus2330 on 29 Sep 2015

Yes, the packages should be updated to build python3 bindings, too. Currently, its simply not done. Nevertheless, you can install python3-dev, so that the build won't break (when python3 bindings+plugins are added to the debian package).

markus2330 on 29 Sep 2015

That doesn't mean they are correct :-) - I'm fairly sure about the python-dev deps.
Can you please replace them and remove the python-support dep?

python3-dev and python2.7-dev is already installed. Otherwise no binding would build.

manuelm on 29 Sep 2015

Btw. the official debian package from @pinotree builds python3-only. It would be a waste of time to fix whats in our "debian" branch, the work of @pinotree is superior anyway.

When I find time, I will update our "debian" branch to what @pinotree has done in the official package. He already allowed us to do so. I will wait for the qt-gui update, currently there is no hurry to change. And having python2 support would be important for one installation (where cheetah is used, which won't work with python3).

markus2330 on 29 Sep 2015

I've never said I'll remove the python2 packages. All I'm saying is python-dev is an inaccurate dependency. We require explicit versions. So pythonX-dev is the correct dep to use.

Hopefully pinotree worked out the dependency correctly.

Btw, cheetah is dead. Don't use it.

manuelm on 29 Sep 2015

Ok, replaced it. Please revert b7c266b36b0ab0fad9120e67a457b580c7c44690 and install python-support if it is needed after all.

I am sure pinotree did it correctly ;)

markus2330 on 29 Sep 2015

And it says: dpkg-checkbuilddeps: Unmet build dependencies: build-essential:native
http://community.markus-raab.org:8080/job/elektra-git-buildpackage-jessie/3/console

markus2330 on 29 Sep 2015

installed

manuelm on 29 Sep 2015

python-dev is a wrong dependency. it should be python2.7-dev or python3-dev or both

python-dev installs the development package for the default Python 2 version; since Wheezy, this is Python 2.7

python3-dev installs the development package for the default Python 3 version; Python 3.2 in Wheezy, 3.4 in Jessie, and so far still 3.4 in Stretch (I guess soon will be 3.5)

So, if you want to build against the default Python 2/3 version, use respectively python-dev/python3-dev, not the pythonX.Y-dev versions (which you need to use when you explicitly want a precise Python version installed, even if not the only one installed on the system, and not the default one). Using either is what I recommend.

pinotree on 29 Sep 2015

from python-dev description:
This package is a dependency package, which depends on Debian's default Python version (currently v2.7).

According to this text python-dev can surely depend on python3 sometime soon

manuelm on 29 Sep 2015

Further more: There never will be another python2 version. So python2.7-dev will be the last python2 dev package ever.

Depending on python3-dev is what I said.

manuelm on 30 Sep 2015

Now only the key is missing:

gpg: new configuration file `/home/jenkins/.gnupg/gpg.conf' created
gpg: WARNING: options in `/home/jenkins/.gnupg/gpg.conf' are not yet active during this run
gpg: keyring `/home/jenkins/.gnupg/secring.gpg' created
gpg: keyring `/home/jenkins/.gnupg/pubring.gpg' created
gpg: skipped "Autobuilder <[email protected]>": secret key not available
gpg: /tmp/debsign.DlSdnFtB/elektra_0.8.13-1.41.dsc: clearsign failed: secret key not available
debsign: gpg error occurred!  Aborting....

markus2330 on 30 Sep 2015

gpg: checking the trustdb
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0  valid:   1  signed:   0  trust: 0-, 0q, 0n, 0m, 0f, 1u
pub   2048R/08C91995 2015-09-30
      Key fingerprint = BA4C 688E 9071 FD3F 57ED  E9D6 D0A9 EDB9 08C9 1995
uid                  Autobuilder <[email protected]>
sub   2048R/E69F110A 2015-09-30

done

manuelm on 30 Sep 2015

Thank you!

Please export /home/jenkins/repository via http.

markus2330 on 30 Sep 2015

cannot access /home/jenkins/repository: No such file or directory ?

manuelm on 30 Sep 2015

→ http://194.117.254.29/elektra-stable/

manuelm on 30 Sep 2015

@manuelm Could you please install ronn on the agents? (needed for generating man pages)

apt-get install ruby-ronn

markus2330 on 12 Nov 2015

done

manuelm on 17 Nov 2015

Thanks, jessie packages build again, and man pages are now included!

markus2330 on 17 Nov 2015

Please install musl, i.e.

apt-get install musl musl-dev musl-tools

Thank you!

markus2330 on 22 Jan 2016

musl installed and agends upgraded

markus2330 on 23 Jan 2016

Two important things about the build server:

Do not create new empty jobs, but rather duplicate, they have correct settings (except whats mentioned in number 2.).
We should use reference clones (in /home/jenkins/libelektra) or prefer shallow clones for every build-job (currently done only for some, e.g. elektra-clang). Currently the traffic is >300MB on commits because of the many unnecessary reclones.

@mpranj It would be great if you can fix 2.

markus2330 on 12 May 2016

@markus2330 just to make sure: I should just apply the same clone behaviour to all build jobs like it is in elektra-clang?

mpranj on 12 May 2016

Shallow clones applied to all build-jobs except:

[ ] elektra-git-buildpackage-jessie
[ ] elektra-git-buildpackage-wheezy
[ ] elektra-multiconfig-gcc-stable
[ ] elektra-multiconfig-gcc-unstable
[ ] elektra-source-package-test

These jobs check out to some sub-directory. Wasn't sure what you want there so I'll leave them as they are for now.

mpranj on 12 May 2016

Thank you! Yes they need complete history and branches, shallow clones make no sense but the reference clone repository would be useful.

markus2330 on 12 May 2016

Jenkins was updated to 1.651.2. Also all plugins were updated.

I will keep the issue open for the reference clone repos. We should also have "cron jobs" which update the repos from time to time, ideally using jenkins itself.

markus2330 on 13 May 2016

Jenkins stopped building some jobs (since the update apparently). It fails with
ERROR: Couldn't find any revision to build. Verify the repository and branch configuration for this job.

mpranj on 13 May 2016

Thanks for the info. I try to downgrade github request builder from 1.31 to 1.14.

markus2330 on 14 May 2016

Now it seems stuck when setting the build status for the Github commit. It does warn that this is deprecated in the config.

mpranj on 14 May 2016

I also tried to downgrade every plugin with *git* in its name, but then there were still errors (strange error related to Mailer Plugin, downgrade of Mailer Plugin did not help). So I updated everything to recent versions again. The problem seems to be a known issue upstream:

https://github.com/janinko/ghprb/issues/347

I hope they will fix it soon.

Another question: Does someone know how to run multiple jobs for every PR? (I would like to run both elektra-mergerequests-stable and elektra-mergerequests-unstable)

markus2330 on 15 May 2016

The elektra-test-bindings job is working fine with parameterized builds (as also described in the upstream ticket). Couldn't we just switch it to parameterized builds? The bug has been reported upstream for a while, I don't see it fixed soon.

mpranj on 15 May 2016

Good idea, we could change all PR jobs to parameterized builds, it actually has only advantages. It allows us to run the jobs manually by specifing a branch, too. And it also can be used for regular build jobs.

Ideally, every job could be executed by github PRs, too. (Except those specifically for non-PR tasks that update docu or coverage of the master branch)

A disadvantage of elektra-test-bindings' config is that it only does polling and takes quite long until it starts building (up to 5 min). I do not want to activate "Use github hooks for build triggering", however, to not break the build job.

Btw. are you sure that the "shallow clone" option is okay for the github pullrequest builder jobs?

I wonder how github picks the build job it uses for new PRs. Why is the elektra-test-bindings and elektra-ini-mergerequests never selected for a new PR? Why is it sometimes elektra-mergerequests-unstable and somtimes elektra-mergerequests(-stable)?

@manuelm do you have any idea?

markus2330 on 15 May 2016

Btw. somehow the communication of finished build jobs and github is severely impaired (even for elektra-test-bindings). It now says on nearly every build "Some checks haven’t completed yet".

markus2330 on 15 May 2016

A disadvantage of elektra-test-bindings' config is that it only does polling and takes quite long until it starts building (up to 5 min).

And this is a problem because? Testing takes more than 5 minutes anyway.

Why is the elektra-test-bindings and elektra-ini-mergerequests never selected for a new PR?

Why should it? elektra-test-bindings gets triggered by the "Trigger phrase" only. No idea what elektra-ini-mergerequests is.

Why is it sometimes elektra-mergerequests-unstable and somtimes elektra-mergerequests-stable?

The -stable/-unstable are new? I'm not sure triggering multiple jobs per new PR is possible. I would do subjobs.

Btw I've said this a few times already but I think the amount of jobs is getting ridiculous and a sign of a messed up config. But criticizing is always easier than solving something.

manuelm on 15 May 2016

The 5 min is a problem when you want to debug the build server. And I still hope that we get a quick first test sometime, taking about 5 min.

Ahh, ok, I missed the option "Only use trigger phrase for build triggering". The config for the github request builder is really a mess.

Someone talked about github projects where they have multiple jobs running for every PR. (Displayed individually)

What is a subjob? Do you mean multijob?

markus2330 on 15 May 2016

Someone talked about github projects where they have multiple jobs running for every PR. (Displayed individually)

You'll have to add two services on github.

What is a subjob? Do you mean multijob?

Yeah multijob.

manuelm on 15 May 2016

btw, what about https://docs.travis-ci.com/ ? Travis has support for OSX.

I know it won't replace jenkins but might replace the PR/on every commit builds. Jenkins can still do the multiple compiler/etc.. testing.
Edit: Travis even has gcc + clang.

manuelm on 15 May 2016

Agreed, it would be interesting to use their CPU power/electricity for free as elektra is open source.

mpranj on 15 May 2016

It is likely that the connection between github and jenkins actually is 1:1. In the github service I entered http://build.libelektra.org:8080/github-webhook/ and I did not find a way to create another URL in jenkins. (I only found a way how to specify a override, but this did not create a new URL).

In https://github.com/janinko/ghprb/issues/142 they discuss that it should "just work"? (Without adding multiple services)

The sha1 problem, however, should be solved now. It was broken because Jenkins introduced a new security measurement which prunes unknown environment variables. I fixed it as suggested (added -Dhudson.model.ParametersAction.safeParameters=ghprbActualCommit,ghprbActualCommitAuthor,ghprbActualCommitAuthorEmail,ghprbAuthorRepoGitUrl,ghprbCommentBody,ghprbCredentialsId,ghprbGhRepository,ghprbPullAuthorEmail,ghprbPullAuthorLogin,ghprbPullAuthorLoginMention,ghprbPullDescription,ghprbPullId,ghprbPullLink,ghprbPullLongDescription,ghprbPullTitle,ghprbSourceBranch,ghprbTargetBranch,ghprbTriggerAuthor,ghprbTriggerAuthorEmail,ghprbTriggerAuthorLogin,ghprbTriggerAuthorLoginMention,GIT_BRANCH,sha1 to /etc/default/jenkins).

About usage of additional build servers: Yes, go ahead. It also solves the issue of multiple build jobs for a single PR ;) I never used travis-ci, so cannot say anything about it. I gave the permission that travis has access to ElektraInitiative.

markus2330 on 15 May 2016

First travis build: https://travis-ci.org/ElektraInitiative/libelektra/builds/130425147
I think we need some yaml file so that travis knows what to do.

And I figured out how to do multiple jenkins jobs per PR, a different context for every build-job was needed. In the next meeting we discuss what the "fast" and other build jobs should do.

markus2330 on 16 May 2016

👍1

I'm working on travis (or checking some things out)

manuelm on 16 May 2016

Have fun. Travis also added a github service, so I guess that a PR will be built with travis, too.

markus2330 on 16 May 2016

I'm already swearing loudly

manuelm on 16 May 2016

-- Could NOT find JNI (missing: JAVA_AWT_LIBRARY JAVA_JVM_LIBRARY JAVA_INCLUDE_PATH JAVA_INCLUDE_PATH2 JAVA_AWT_INCLUDE_PATH)
-- Exclude Plugin jni because jni not found

I'm unable to get the java plugin configure correctly. However Java bindings work. On debian unstable. Any ideas? Looking at the cmake module doesn't help much.

Edit: /usr/lib/jvm/java-8-openjdk-amd64/include/linux/jni_md.h, /usr/lib/jvm/java-8-openjdk-amd64/include/jawt.h and /usr/lib/jvm/java-8-openjdk-amd64/include/jni.h is in place

Edit 2: Got it. JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/ ....

manuelm on 16 May 2016

https://travis-ci.org/manuelm/libelektra/builds/130638376

Debian unstable built within a docker container. But building takes ages.
Any good ideas?

manuelm on 16 May 2016

clang is often faster regarding build time, but i think the installation of the dependencies is what takes a large amount of time

rautesamtr on 16 May 2016

Isn't there a more minimal debian docker image then the one used? It seems a lot of packages get installed that should not be needed.

rautesamtr on 16 May 2016

@sirblackheart What makes you think that? https://github.com/manuelm/libelektra/blob/a3d4b9212e738f9b8f2c13318b8a661522dff4a6/scripts/docker/Dockerfile.debian_unstable#L6

manuelm on 16 May 2016

@manuelm probably the dist-upgrade. a lot of packages get updated that are desktop specific like wayland

rautesamtr on 16 May 2016

No. dist upgrade is short. maybe a minute. about 50% of the time is taken by installing the build deps.

manuelm on 16 May 2016

I'm pushing the build image to hub.docker.com right now. Hope that will speed things up. But the image has 1.9gb

manuelm on 16 May 2016

Elapsed time 14 min 8 sec

Not sure if we can do much better

manuelm on 17 May 2016

🎉1 👍1

like i said, clang maybe gets us 2-3 mins. at least it does for the aseprite project
https://travis-ci.org/aseprite/aseprite

rautesamtr on 17 May 2016

It would be useful to have both compilers anyway.

markus2330 on 17 May 2016

Just had an idea while preparing work stuff: What if we extract the paths of all commits in the push request and build bindings/plugins only if they are affected? e.g.

change in cmake/* triggers everything (plugins + bindings)
change in src/bindings/foo triggers binding foo
change in src/plugins/foo triggers plugin foo
change in everything doesn't compile any plugins + bindings

We still have the daily/twice a day full build on jenkins.

manuelm on 17 May 2016

@manuelm good idea, @tom-wa would write such a script, can you create a new issue for this?

markus2330 on 17 May 2016

@mpranj: Reminder: add Mac OS X builds to travis and add mingw builds to PR. (*BSD seems to be more effort)

markus2330 on 19 May 2016

@markus2330 now i understand @manuelm docker approach, travis will not support ubuntu 16.04 till next year so docker is needed to get all dependencies that ubuntu 14.04 does not have = swig3.0 libsystemd-devel.

rautesamtr on 19 May 2016

I'm sorry I couldn't attend the meeting today. At work we're still preparing a big software rollout today, so I can't leave the office. But within a short delay I can answer e-mails.

I've started to add OS X builds for travis 2 days ago: https://github.com/manuelm/libelektra/blob/e41ac43a18e5e9f9640a4042a313cc43f2704f65/.travis.yml
Build is here: https://travis-ci.org/manuelm/libelektra/builds/130898079
Open things here:

[ ] cryto_openssl fails to compile
[ ] bindings tests fail
[ ] no java

I'm happy if anyone will assume my work from here. I don't have OS X and waiting for travis to inspect the OSX system sums up very fast.

re: docker: Yeah, travis default ubuntu version doesn't work well. Even cmake is too old.
Fetching the uploaded docker image takes only about 3mins. And adding more images is a no brainer. So I think that's a nice way to workaround any pitfalls the default travis linux environment has (or might have after an update).

I haven't figured out a good way to integrate the different build and test phases of libelektra (cmake + make + make test) with docker (build + run) + travis (before_install, before_script, script). Docker containers exit after the command completes. Since docker containers are meant to be throw-away, you cannot resume afterwards. So your disk/compile state is vanished, unless you mount a local directory into the container. Will continue to work on docker next week.

manuelm on 19 May 2016

@manuelm Great, you got further then we thought. Mac OS X for PRs and per commit would be really great. A lot of people are using Mac OS X now and I do not want to break the build for them again and again. In the meeting today @mpranj said he would pick up your work. Do you want to create a PR with the travis file?

markus2330 on 19 May 2016

No, as the travis file still has to be modified afterwards. Otherwise it will build OS X only. I would rather prefer if @mpranj takes up my travis file and fix the remaining OSX related issues. I'll then take his travis file, convert it into a matrix build and integrate the linux/docker builds + #730 (if available by then)

PS: please do travis testing in a usernamespaced repo. You'll do a lot of pushes :-)

manuelm on 19 May 2016

mingw64 builds on PR added, should work. Sorry for the delay. I'll look into travis today!

mpranj on 21 May 2016

👍1

Is there a downside of enabling build jobs to be triggered with a phrase in a PR (with the Github PR builder)?

I'd like to configure the jobs from #745 so I can test whether I fixed it, but I can apply it to most/(all) build jobs.

EDIT: I'd rather not automatically start all jobs, we have quite a few already.

mpranj on 24 May 2016

I think its a good idea if we can configure every job to be triggered with a phrase. I think there is a small downside (at least for elektra-test-bindings): you have to enter for which branch you want to build and cannot simply press "build job now". Would be great if you find a solution for that.

And you are right about that we should rather reduce the automatic jobs.

markus2330 on 24 May 2016

There's actually a very simple solution. We're using the (env)variable sha1 to build PRs. Parameterized builds prompt you for the value, whether a default value is set or not.

Solution: set the env-variable sha1 to master (in the jenkins config itself) and disable parameterized builds. If there's no objection to setting the variable, this would solve exactly what you mentioned above @markus2330.

I have already set it, so you can hit that build button on e.g. elektra-mergerequests and it will start building master.

mpranj on 25 May 2016

Yes, this is a very good solution, I like it. It would also allow us to build a release branch with a single switch (if we need one in the future). Until then "master" is always the correct choice if not executed from within a PR.

I think it would also solve the problem of the filtered environment variable, we had earlier.

Then we can also think about reducing the build-jobs (no duplication for -mergerequest bulid jobs) and a new consistent naming schema. (Suggestions can be done here.) There might be one open problem: Currently we build coverage, docu,.. for both PRs and master and copy them to separate places. If we merge the build jobs we need a way to distinguish within the job master/PR-jobs to copy coverage, docu to different places.

markus2330 on 25 May 2016

I'm almost done applying this to all jobs (but the server _just_ got really slow).
Didn't apply to jobs which build wildcards ** (doc and some others, but very few)

mpranj on 25 May 2016

You can always stop build jobs when you want to work on it if you restart them later (except in release time). Usually, jenkins itself is the reason for slowdown of the machine. At the moment a rsync from a backup might be the problem, but it is urgent.

markus2330 on 25 May 2016

Yeah no problem at all, it should be done but I'll make some last checks.

The news @ElektraInitiative/elektradevelopers:

as mentioned almost _everything_ can be build now from PRs and/or hitting only the build button
the trigger phrases are always the job name without the elektra- prefix. (e.g. elektra-clang: jenkins build clang please) I did not change jenkins build please and other old phrases for legacy reasons
the github build status message is always exactly the build job name

mpranj on 25 May 2016

Thank you, well done! Please update doc/GIT.md so that everyone knows which phrases are working now.

(I hope the @mention works for a single message only and not everyone reads every message we write here)

markus2330 on 25 May 2016

Mac OS X for xcode 6.1 build seem to be broken:
https://travis-ci.org/ElektraInitiative/libelektra/jobs/138919488

markus2330 on 20 Jun 2016

I triggered a re-build for that one but it seems to me like a temporary travis failure.

mpranj on 20 Jun 2016

Can you document how to retrigger a build for a PR? I did not know it is possible.

markus2330 on 20 Jun 2016

Directly on travis-ci.org, using your link above:
scr

I doubt this is document-worthy but I can to it nevertheless.
The build is still not working solely because of the git checkout. I don't think this is our fault.

mpranj on 20 Jun 2016

Ah. I think you merged it before the build was triggered/started in the first place.
When I rebuild other previously successful PRs, it's broken also.

This is more of a travis problem than anything else.

mpranj on 20 Jun 2016

Ok, thank you for investigating.

markus2330 on 21 Jun 2016

@manuelm debian-stable-mm seems to be unreachable (for both jenkins and for me from TU network). Could you please investigate?

markus2330 on 12 Jul 2016

Jul 07 15:14:37 <hostname> systemd-nspawn[544]: [  OK  ] Removed slice User and Session Slice.
Jul 07 15:14:37 <hostname> systemd-nspawn[544]: [  OK  ] Stopped target Graphical Interface.
Jul 07 15:14:37 <hostname> systemd-nspawn[544]: [  OK  ] Stopped target Multi-User System.
etc..

Looks like someone stopped the container. I've started it again.

btw, starting tomorrow morning I'll be away from home until August, 1. I'm still reachable by e-mail but expect a short delay.

manuelm on 12 Jul 2016

Thank you for the quick fix! So I suppose you also won't be here for the next meetings.

markus2330 on 12 Jul 2016

Yep

manuelm on 12 Jul 2016

Some jobs have the error:

Seen branch in repository origin/debian
Seen branch in repository origin/kdb_import_man
Seen branch in repository origin/master
Seen 3 remote branches
FATAL: Walk failure.

e.g. http://community.markus-raab.org:8080/job/elektra-icheck/lastFailedBuild/console http://community.markus-raab.org:8080/job/elektra-doc/lastFailedBuild/console

It might be caused by an jenkins update or @KurtMi creating the kdb_import_man branch?

Note to myself: cppcms needs to be installed.

markus2330 on 16 Jul 2016

Sorry for the branch, I mad a PR direct on the github page.

KurtMi on 17 Jul 2016

Is it easier to create PRs this way? Doesn't github offer to delete the branch after it was merged?

markus2330 on 17 Jul 2016

The change was so minimal, so I got lazy. For very small fixes yes, but apparently the branch will not get deleted afterwards. I have not seen any delete branch after merge.

KurtMi on 17 Jul 2016

I think the unstable build is broken:

Cloning the remote Git repository
Cloning repository git://github.com/ElektraInitiative/libelektra.git
 > git init /home/jenkins/workspace/workspace/elektra-mergerequests-unstable # timeout=10
Fetching upstream changes from git://github.com/ElektraInitiative/libelektra.git
 > git --version # timeout=10
 > git -c core.askpass=true fetch --tags --progress git://github.com/ElektraInitiative/libelektra.git +refs/heads/*:refs/remotes/origin/*
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git -c core.askpass=true fetch --tags --progress git://github.com/ElektraInitiative/libelektra.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: fatal: The remote end hung up unexpectedly

Full log

KurtMi on 18 Jul 2016

@KurtMi unstable work again (among most builds), but the Walk error persists on some of the simpler build jobs. It seems like the branch is still available somewhere, maybe in a cache on the build servers?

 > git -c core.askpass=true fetch --tags --progress git://github.com/ElektraInitiative/libelektra.git +refs/heads/*:refs/remotes/origin/* --depth=1
Seen branch in repository origin/debian
Seen branch in repository origin/kdb_import_man
Seen branch in repository origin/master
Seen 3 remote branches
FATAL: Walk failure.
org.eclipse.jgit.errors.RevWalkException: Walk failure.

markus2330 on 24 Jul 2016

@mpranj Maybe we should add more scripts in-source, this would allow easier updates for every build job. In #806 we found yet another bug with spaces in the build directory, so we should add globally (for every build-job) add spaces in the build directory. I would prefer if we can add a script jenkins-setup which exports some useful variables (such as export HOME="$WORKSPACE/user space) and does

mkdir "build space"
cd "build space"

Furthermore we should create build-jobs that update one global repo. Individual tasks see above.

markus2330 on 27 Jul 2016

The global repo could definitely help reduce bandwidth. Build scripts in-source could also be a good idea, at least they would be tracked by git.

I'm not a fan of spaces in paths, but sure.

mpranj on 27 Jul 2016

fast build in passwd broken?

markus2330 on 11 Aug 2016

The fast build job is annoying, I try to remove kdberrors.h on every build and see if it works more smoothly then. In the long run @manuelm proposal in #730 is the best solution: we should simply check how the source got updated, and based on this take appropriate measurements.

markus2330 on 21 Aug 2016

I think #894 fixes the fast build, too, I will comment the line removing kdberrors.h out.

markus2330 on 28 Aug 2016

Some jobs are broken, for example the html doc job.

KurtMi on 1 Sep 2016

@mpranj Do you have time to look at it?

markus2330 on 2 Sep 2016

@markus2330 Done. The remaining build failures don't seem build system related.

mpranj on 3 Sep 2016

@mpranj Thank you! What did you do to fix it? I think it would be useful if we collect here also solutions to build server problems.

markus2330 on 3 Sep 2016

I changed in "Source Code Management" > "Git"
"Branch Specifier" value "**" to "${sha1}"

This is what we use in the other jobs too. This allows to trigger the build by button (branch defaults to master) or by github PR builder (sha1 of commit).

I recall setting the ENV variable "sha1" to "master" once. It seems missing now, but the jobs work fine so let's ignore that.

mpranj on 4 Sep 2016

I think we would be able to speed up the builds a lot more by using Object Libraries more frequently. A lot of object files are compiled multiple times. We would only have to make sure that the compile flags are the same for every place we use the object libraries in, but I guess this should be easily possible.

An example where it could make a big difference is KDB in my opinion.

Namoshek on 12 Sep 2016

@Namoshek Please only post build _server_ stuff here, not about build system in general. Object libraries are already used for plugins, but for different variants different object libraries are needed nevertheless (because of different compiler flags). But please report concrete suggestions as separate issue (Do you mean kdb tools?).

markus2330 on 12 Sep 2016

Jenkins was upgraded to 2.7, all plugins upgraded, recommended plugins were added:

Pipeline (installation seems to have failed?)
GitHub Organization Folder Plugin

and some plugins uninstalled:

Branch API
CVS/SVN (seems to be no longer essential)

Additionally, ruby-dev was installed on every agent.

markus2330 on 29 Sep 2016

I updated the "Current Issues" in the top post. It would be important that Elektra also compiles without any dependencies installed, so we should check this with build server agents that do not have any dependencies installed (except cmake and build-essential). FreeBSD and OpenBSD build agents, however, are important, too ;)

markus2330 on 1 Nov 2016

@mpranj Do you have any idea what is wrong with elektra-multiconfig-gcc47-cmake-options? They have "fatal: reference is not a tree:" errors all over. They job has "sha1" in its config?

markus2330 on 18 Nov 2016

I made the multiconfig independent of concrete compilers (there are enough other build jobs for specific compilers), so they should be able to run on any agent.

markus2330 on 18 Nov 2016

@markus2330 No idea. I didn't change anything and just:

manually triggered a build from master
triggered a build from github

Both builds were able to check out the tree and start building.
So: I can't reproduce it.

One idea: travis had problems when there was a PR and you merge it before travis could do the clone. Maybe something similar happened with elektra-multiconfig-gcc47-cmake-options since a build there takes ~3hrs.

mpranj on 18 Nov 2016

Pushing artifacts to doc.libelektra.org works again, Jenkins and Plugins were upgraded.

markus2330 on 10 Jan 2017

I updated the new build server URL https://build.libelektra.org in the github's Webhooks. So hopefully the next PRs will be build again.

markus2330 on 21 Jan 2017

Jenkins home is almost full. It also seems not to be building PRs.

mpranj on 21 Jan 2017

Jenkins home is almost full.

Thanks I resized it.

It also seems not to be building PRs.

Do you have any idea what could be wrong here? Manual triggering seems to work?

markus2330 on 21 Jan 2017

Publishing docu to doc.libelektra.org:12025 failed for the builds. I restarted the ssh server (on the build-homepage agent) and it seems to work again.

markus2330 on 7 Apr 2017

The vserver for *.libelektra.org seems to be not reachable. I reported it at hetzner.

markus2330 on 25 May 2017

The reason for shutting down the network connection from the container was that libelektra.org is compromised. See #1505 for more information.

markus2330 on 26 May 2017

It would be great if we can add a git-build-package for stretch, there are more and more places popping up where we would need debian packages built for stretch.

@BernhardDenner did you look through the top post? Is there something you can easily do? Is there something @mpranj needs to consider when improving build server jobs in the future?

markus2330 on 6 Jul 2017

As requested by @sanssecours I (temporarily) disabled elektra-mergerequests so that PRs do not always get a wrong error. Furthermore I added for @KurtMi

jenkins build gcc-configure-debian-optimizations please

@KurtMi If you need changes what the build job does, simply modify scripts/configure-debian-optimizations

markus2330 on 24 Jul 2017

❤1 🎉1 👍1

@sanssecours now also has access to the build server.

Btw. you can cancel jobs if they are superseded by other jobs anyway (currently there is a heavy load). Only take care to not abort jobs for active PRs, otherwise the PR won't get green. (Unless you restart them with the phrase "jenkins build ... please".)

markus2330 on 29 Oct 2017

@sanssecours Jenkins was restarted (the second time). Could you please document here if you install new plugins. (Updates do not need to be documented).

Requests for restarts can also be done here.

markus2330 on 31 Oct 2017

👍1

I changed "Quiet period" from 2 to 5 to give more time to merge multiple PRs and/or push different commits without rebuilding repetitively.

Furthermore, I opened the issue #1689 describing timeouts in builds (I did not add it here due to long error messages).

I also moved some obsolete tasks above in the new section "Obsolete/irrelevant Issues [reason]:".

markus2330 on 1 Nov 2017

I updated the plugins on the build server. Hopefully the updates fix the problems we have in PR #1698 and PR #1692.

@markus2330 Can you please restart the build server?

sanssecours on 8 Nov 2017

I upgraded Jenkins from 2.73.2 to 2.73.3 and restarted Jenkins.

Hopefully the updates fix the problems we have in PR #1698 and PR #1692.

It might be a general problem not related to these two PRs? Hopefully it is fixed now.

markus2330 on 8 Nov 2017

👍1

Looks like JENKINS_HOME is almost full 😢.

@markus2330 👋 Could you please

clean the home directory or tell me how I can do that,
update Jenkins and all outdated plugins?

sanssecours on 13 Dec 2017

Thank you for pinging me!

Seems like a plugin had an "Arbitrary file read vulnerability", namely the "Script Security Plugin 1.35".

I upgraded all plugins and also upgraded jenkins from 2.73.3 to 2.89.1.

Furthermore, I resized the disk from 20GB to 50GB.

We should restart the server soon, there are some non-restarted processes that might be affected by library upgrades, which might be insecure at the moment (not related to jenkins, though). @BernhardDenner Can you do the restart (and do the fixes if something does not start up)?

Please do not hesitate to report anything I broke during these upgrades.

markus2330 on 13 Dec 2017

❤1

The server had load 20 and hardly responded. We need to be careful with "jenkins build all please" and in longer term we should move the agents away from the main server.

I upgraded to Jenkins 2.89.2 and restarted the server. I'll report when everything is up and running again.

markus2330 on 18 Dec 2017

Seems like all agents are now disconnected with the error "The server hostkey was not accepted by the verifier callback".

@BernhardDenner I saw that puppet apply was running, are you currently working on the setup?

markus2330 on 18 Dec 2017

👍1

I tried to downgrade to 2.89.1 and 2.73.3 without any success: connecting to agents still does not work.

markus2330 on 19 Dec 2017

A huge thanks to @BernhardDenner who fixed the ssh problem.

We should stop upgrading Jenkins without reading the release notes, seems like that even the stable updates break too many things. (That are not even revertible by downgrading!)

markus2330 on 19 Dec 2017

🎉3 ❤2 👍1

I have to report a major bottleneck at the build server.
elektra-multiconfig-gcc47-cmake-options takes 14h and the
elektra-multiconfig-gcc-stable takes 4h.
I am not sure if that is a new behavior and I am aware that these jobs are not a single build job, but this bottleneck should not be unnoticed.

KurtMi on 19 Jan 2018

👍1

Thank you for reporting. The idea was to distribute subjobs of these jobs to the ryzen hardware, unfortunately nobody had time for the setup. If someone is interested, please contact me.

markus2330 on 19 Jan 2018

a7.complang.tuwien.ac.at (ryzen) seems to have crashed. I reported the problem. Our admin will hopefully restart the computer on Monday.

markus2330 on 20 Jan 2018

I temporarily disabled the incremental (strange error, see #1784), the admin restarted the ryzen server, and then I restarted jenkins (because Jenkins could not connect to ryzen and there was a huge backlog of ryzen builds).

ryzen now works again and builds the backlog.

markus2330 on 22 Jan 2018

The idea was to distribute subjobs of these jobs to the ryzen hardware

@markus2330 I've noticed there is an option called Run each configuration sequentially in the configuration matrix settings of the multiconfig job. Maybe it gets distributed automatically if we simply untick this so it builds several config options at once, or have you tried this already?

e1528532 on 23 Jan 2018

No, I haven't tried it, please give it a try.

markus2330 on 23 Jan 2018

@markus2330 judging from the build server queue this seems to do the trick, i'll apply it to gcc-stable additionally after it worked for gcc-stable-multiconfig

I noticed however that the ryzen doesn't seem to handle those jobs. I think this is because it is configured to only handle jobs matching its tags and the multiconfig builds don't seem to set those tags appropriately on a first glance. So we should either make ryzen execute everything thats possible or set more tags on the build jobs. It looks like ryzen doesn't handle jobs which don't have any tag set at all.

e1528532 on 23 Jan 2018

👍1

i'll apply it to gcc-stable additionally after it worked for gcc-stable-multiconfig

Thank you!

I noticed however that the ryzen doesn't seem to handle those jobs.

No, it doesn't but it already executes a bunch of other jobs. But maybe we can make v2 to do so?

markus2330 on 24 Jan 2018

i'll apply it to gcc-stable additionally

done

But maybe we can make v2 to do so?

I've configured the v2 and now i only wait for the #1806 PR to be merged so i can allow more builds on it than one. I thought that 8 jobs should be fine for it since its an 8-core, with -j 2 in order to utilize the SMT as well?

To restart the build-v2 container in case v2 crashes or gets restarted simply type . Note this can only be done if the container has been built already - follow doc/docker/jenkinsnode/README.md for these instructions. Then use this command to restart after the container was created but has stopped:

docker start build-v2

Also, to forward the ssh connection of the new build node from the v2 via the a7 to the outside world, i've set up the following ssh tunnel on the a7 (the docker container maps its ssh port to 22222 on the v2):

ssh -L 0.0.0.0:22222:localhost:22222 <username>@v2.complang.tuwien.ac.at

Adding to that, the public ssh key of the docker container changes on every image rebuild and thus has to be adjusted in the build server as well. This is not necessary if the container only gets restarted. To find it out, enter the following on the v2:

sudo docker exec -it build-v2 bash
# now you should be on the docker machine
cat /etc/ssh/ssh_host_ecdsa_key.pub
> ecdsa-sha2-nistp256 <blablablalb> <root@6b906cc01f23>

Only copy the first two things, so the key algorithm and the key itself, don't copy this user information at the end to ryzen-v2's configuration in jenkins for the ssh key!

e1528532 on 1 Feb 2018

❤1 🎉1

v2 is down, I informed the admin. Its quite strange: a7 and v2 are both completely new hardware and the incidents are quite frequent.

markus2330 on 7 Feb 2018

v2 seems to be back and i've restarted the build container there. So hopefully we have faster builds now again. Additionally i've added elektra-haskell to "jenkins build all please" as I want to have stable haskell builds for my typechecker, so testing is a good addition.

Furthermore i want to leave a note here that we additionally want to create another build node that takes care about the mm builds which now appear to be the new bottleneck on the v2 as well.

Last, @markus2330 i think the point run bashism checker is already done, as this is one of our usual tests /testscr_check_bashisms.sh.

e1528532 on 9 Feb 2018

👍1

All nodes for the label debian-jessie-homepage||homepage and the build agent debian-wheezy-mr are currently offline. Restarting the build agents does not work. It would be really nice if someone with SSH or physical access to these nodes could look into this problem.

sanssecours on 16 Feb 2018

👍1

I restarted the vservers but restarting the agents within jenkins did not work with the error "No valid crumb was included in the request". @BernhardDenner Do you have an idea?

markus2330 on 16 Feb 2018

Seems like v2 is down, too. So we have 3 non-functional agents 😢

markus2330 on 18 Feb 2018

v2 is back, but i wonder why it always goes down? could it be related to our build process?

Regarding the "no valid crumb" i saw that too when i tried to restart the agent on v2, but when i simply tried it again it worked.

e1528532 on 19 Feb 2018

👍1

v2 is back, but i wonder why it always goes down? could it be related to our build process?

It seems to be a kernel/hardware failure (not even sysreq works when the computer hangs). Our usage might trigger the error. The computer was running without any errors for several months and since we use it we already crashed it three times.

I upgraded the kernel and purged the X-server.

Regarding the "no valid crumb" i saw that too when i tried to restart the agent on v2, but when i simply tried it again it worked.

Thank you! I was now able to start the homepage agent again.

Furthermore, I disabled the agent debian-jessie-minimal and its build job. We should create docker containers for minimal jobs, I added this as task.

markus2330 on 20 Feb 2018

🎉1

As we were surprised yesterday the community server was down because it crashed and a wrong ARP-cache redirected our IPs to other servers. After restarting, everything worked again, but the raid sync is still ongoing. (It might be so slow because of the high load.)

The community server has a near-constant load of 10. At the moment it is 13.20 11.29 9.35. We really should reduce the jobs directly running on the community server and move the load to v1. Any volunteers?

There is an upgrade of Jenkins from 2.89.2 to 2.89.4. Unfortunately I did not find an easy way to see the Changelog (apt-get changelog fails because it is an unofficial package). Any reasons to not do this update?

markus2330 on 23 Feb 2018

The upstream changelog is at https://jenkins.io/changelog-stable/
Apparently 2.89.4 contains security fixes.

mpranj on 23 Feb 2018

Thank you for looking it up!

I upgraded to Jenkins 2.89.4 and everything is running again.

elektrahomepage was not started by default after reboots, I changed that (/etc/vservers/elektrahomepage/apps/init/mark=default).

I also activated the test-docker build job.

markus2330 on 23 Feb 2018

v2 is down, again :cry:

But at least a7 seems to be stable now.

I installed clang-format-5.0 on a7 and on the stretch node (debian-stretch-mr).

For the next PRs please reformat according to clang-format-5.0.

markus2330 on 25 Feb 2018

https://build.libelektra.org/jenkins/job/elektra-clang-asan/ was temporarily disabled.

We are currently investigating v2. UEFI is from 6.6.17. It seems like the crashes always happened at weekends, maybe there is a higher load at that time? I'll try to replicate v1 setup on v2.

markus2330 on 26 Feb 2018

👍1

v1 and v2 are up and running with the same kernel.

@e1528532 seems like your ssh bridge did not start and the command in doc/docker/jenkinsnode/README.md fails with "unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /root/Dockerfile: no such file or directory" and then "Unable to find image 'buildelektra-stretch:latest' locally". This means v2 is not reachable at the moment.

markus2330 on 26 Feb 2018

@markus2330 wrote: moved issue to #1829

sanssecours on 26 Feb 2018

I think one of the latest update to the build server broke elektra-gcc-configure-debian-stretch, which is not able to connect to the repository anymore:

stderr: fatal: Unable to look up github.com (port 9418) (Name or service not known)

.

sanssecours on 26 Feb 2018

I think the problem with elektra-gcc-configure-debian-stretch is the build server ryzen, which is unable to connect to GitHub. I changed the label for the build job from debian to debian-stretch-mr accordingly. Now the build job seems to work again.

sanssecours on 27 Feb 2018

👍1

ryzen, which is unable to connect to GitHub

Seems like our admin's NetworkManager fix with "manged=true" did not work reliable. After restart "/etc/resolv.conf" was again a dangling symlink. I fixed it again, GitHub should be reachable from ryzen. rzyen v2 is unfortunately still not reachable (ssh bridge is missing).

Elektra 0.8.22 is finally released. I'll add the link to #676 once the website has been built, the website building takes more than an hour. Maybe we can move the homepage build to a faster machine and only copy the resulting website to its location.

markus2330 on 27 Feb 2018

I think we have do something about the server that hosts http://build.libelektra.org. It is just unbearably slow and unresponsive 😢. Personally I do not care if a full build of all tests takes a long time. However, as it currently stands it takes minutes to even connect to the server, if we are able to connect to the server at all:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/jenkins/">GET&nbsp;/jenkins/</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
<hr>
<address>Apache/2.4.10 (Debian) Server at build.libelektra.org Port 443</address>
</body></html>

.

sanssecours on 2 Mar 2018

Yes, affected is not only Jenkins but everything else running on this server. For me the situation often is also unacceptable. It seems like there is too little RAM. (2GiG swap is used)

Jenkins might be the reason, there are dozens of Java processes which lead the list in htop. For a long time we had enough RAM and swap was hardly used, and we did not change much (except of upgrading Jenkins and increasing the number of build jobs).

I suggest to stop using the community server as agent at all. For this we would need v2 back but @e1528532 seems to be busy. We could also rent a better server but then we would need someone who has time for the migration.

markus2330 on 3 Mar 2018

@markus2330 Can you please restart the build server? Currently even simple jobs like elektra-todo fail.

sanssecours on 5 Mar 2018

i restarted the v2 on sunday but apparently its already down again, so we should first get v2 stable before thinking about putting other things on it.

e1528532 on 5 Mar 2018

I restarted Jenkins and v2. Jenkins seems to be running smoothly again.

@e1528532 The ssh tunnel seems still not to work. Even after restarting a7 it is not possible to connect to v2.

so we should first get v2 stable before thinking about putting other things on it.

The main downtime was caused by the ssh bridge. If v2 had troubles it was usually restarted within a day.
I now removed the rest of the X server, so I hope v2 is now stable, too. For a7 this seemed to be the trick (no restart was necessary for quite some time). Without load on v2 (which requires the ssh bridge), however, we wont know for sure if it is stable.

What about splitting up discussions about hardware (restarts) and software (Jenkins upgrades)?

markus2330 on 5 Mar 2018

👍1

There seems to be a network connectivity problem between a7 and v2. v2 is up and running but still I still get "No route to host". Seems like I cannot fix it today.

markus2330 on 5 Mar 2018

The network of v2 was down because the deinstallation of GNOME also deinstalled network-manager. We now fixed the network (using /etc/network/interfaces) and upgraded to the latest BIOS/UEFI. So hopefully everything is now stable.

Btw. there is one more hardware we could use via an ssh bridge... (PCS)

markus2330 on 6 Mar 2018

🎉1

The ssh tunnel seems still not to work. Even after restarting a7 it is not possible to connect to v2.

Yes this was not automated. Now i have taken care about everything. The docker container should restart automatically now if the machine restarts. At least i've set the --restart flag to "always" according to https://stackoverflow.com/questions/29603504/how-to-restart-an-existing-docker-container-in-restart-always-mode#37618747

Furthermore I've created a new user called "ssh-tunnel-a7-v2" which has no password set on both a7 and v2 (so, disabled password authentication for that one). I created an ssh certificate for the user on the a7 and added the public key of it to the known hosts on v2. Then i added a systemd service to /etc/systemd/system/ssh-tunnel-a7-v2.service which opens the ssh tunnel automatically as a systemd service according to https://gist.github.com/guettli/31242c61f00e365bbf5ed08d09cdc006#file-ssh-tunnel-service . Therefore it should also work when the server gets restarted or the ssh connection crashes and no longer depend on me or my user. Due to the use of a certificate no passwords have to be used for connections.

On top of that, v2 is restarted of course with this new automated configuration active. Hopefully it survives the next crash (if there is one), theoretically it should but we will see.

e1528532 on 6 Mar 2018

👍2 ❤1

The build job test-docker always fails, if Jenkins executes the job on ryzen v2:

docker inspect -f . elektra-builddep:stretch
/home/jenkins/workspace/test-docker@tmp/durable-7755b812/script.sh: 2: /home/jenkins/workspace/test-docker@tmp/durable-7755b812/script.sh: docker: not found
[Pipeline] sh
[test-docker] Running shell script
+ docker pull elektra-builddep:stretch
/home/jenkins/workspace/test-docker@tmp/durable-d1c2efc5/script.sh: 2: /home/jenkins/workspace/test-docker@tmp/durable-d1c2efc5/script.sh: docker: not found

. I wanted to restrict the job to nodes other than ryzen v2, but it seems the option for this step is missing in the configuration page of test-docker. Could someone please have a look and fix this problem?

sanssecours on 7 Mar 2018

Thank you looking into it! Isn't it possible to assign multiple labels to the agents? Then you could assign a unique label to ryzen v2 and tie the job to it.

Luckily, we will get support for our build server soon :+1:

markus2330 on 8 Mar 2018

🎉1

Isn't it possible to assign multiple labels to the agents?

As far as I know yes, it is possible to assign multiple labels to an agent.

Then you could assign a unique label to ryzen v2 and tie the job to it.

As I already stated before [[1]] the option “Restrict where this project can be run” seems to be missing:

I wanted to restrict the job to nodes other than ryzen v2, but it seems the option for this step is missing in the configuration page of test-docker.

.

sanssecours on 8 Mar 2018

Ahh, I misunderstood your statement as: "There is no way to write a boolean expression that allows me to say (stable && !ryzenv2)", not that there is no option for agent restriction at all.

Maybe this can be done by the DSL. I'll ask Lukas if he knows what to do.

markus2330 on 8 Mar 2018

👍1

Hi,

as @sanssecours noted ryzen v2 does not have docker installed but it has the docker tag.
test-docker runs require nodes to have the docker tag.

Possible solutions are to either install docker on the node or remove the tag from the node in jenkins

ingwinlu on 8 Mar 2018

❤2

Possible solutions are to either install docker on the node or remove the tag from the node in jenkins

Thank you for providing a solution for the problem. I just removed the docker tag from ryzen v2. As far as I can tell everything seems to work now.

sanssecours on 8 Mar 2018

❤1

I updated the description of the 'ryzen v2' node to reflect that it is actually 'only' a docker container running on v2. Hence why docker was not available even though it is installed on v2.

Also added a plugin to jenkins which allows easier build data visualisation (not having to click into each build)

ingwinlu on 9 Mar 2018

👍1

v2 is down again, I reported it.

markus2330 on 11 Mar 2018

I rebooted v2 but could not reconnected the agent.

At least we finally got error messages of what happened before the crash (of course there is no guarantee that the error messages have anything to do with the crash):

watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [docker-containe:789]
...
NMI watchdog: Watchdog detected hard LOCKUP on cpu 14
...

markus2330 on 12 Mar 2018

Strange, it looks like my restart machinery did work, both the ssh tunnel and the docker nodes were restarted and i can connect to a7.complang.tuwien.ac.at -p 22222, which means everything should be open. However somehow jenkins just show me an infinite spinning wheel for some reason, no timeout, no nothing.

I tried my manual ssh bridge like we had before, the same. Restarted the docker container once more, the same. So honestly i'm not sure what exactly is wrong now without an error message, the only thing i found is some guy who apparently has a similar bug (spinning wheel but no message) but no solution on that other than restarting the whole master jenkins node (which i haven't tried): https://issues.jenkins-ci.org/browse/JENKINS-19465

EDIT: i tried one of the suggested workarounds (reset hostname configuration to something that doesn't exist, reconnect, then jenkins realizes the hostname is wrong, change back to the actual hostname, then it suddenly worked without any further hassle). So i guess this error happened instead of something with my restart setup but lets wait for the next crash to be sure on that ;)

@markus2330 i bet you already found this out yourself but a quick search showed me that this might be related to the c-state configuration: https://bugzilla.kernel.org/show_bug.cgi?id=196683 , there are some suggested workarounds for that

e1528532 on 12 Mar 2018

❤1

Seems like the build server ryzen is unable to connect to our repository:

Failed to fetch from https://github.com/ElektraInitiative/libelektra.git

.

sanssecours on 12 Mar 2018

the dns config was dangling again. Since I don't fully understand why it is
setup the way it currently is I only restored the nameserver settings and
restarted the docker build job.

On 12 March 2018 at 17:03, René Schwaiger notifications@github.com wrote:

Seems like the build server ryzen is unable to connect to our repository
https://build.libelektra.org/jenkins/job/test-docker/162/console:

Failed to fetch from https://github.com/ElektraInitiative/libelektra.git

.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-372363457,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOv-gcB-XWqDbqbRZfRnfnadYjZN21hks5tdpxLgaJpZM4DIApm
.

ingwinlu on 12 Mar 2018

❤1 👍1

Since I don't fully understand why it is setup the way it currently is

I am afraid nobody understands it. Maybe the DNS server is misconfigured and does not give proper nameserver information. For v2 we uninstalled the network manager and it seems like resolv.conf is now stable there. So one option is to uninstall network manager on a7, too. (and use /etc/network/interfaces) There is no reason that v2 and a7 have diverging setups, it is only because of sloppy administration.

Ideally (in the long run), we manage both using Puppet.

https://bugzilla.kernel.org/show_bug.cgi?id=196683 , there are some suggested workarounds for that

C6 should be disabled but we will continue the investigation.

markus2330 on 13 Mar 2018

The new build agent "ryzen v2 docker" does not seem to have a D-Bus daemon running like "debian-stable-mm".

Can someone please either install/start it or tell me which script configures the multiconfig-gcc47-cmake-options builds so that I can add a snippet to make sure it is started?

waht on 13 Mar 2018

@markus2330 I suspect since both ifupdown and networkmanager are enabled they both get into each other's hair. disabling one of the two certainly would help.

ingwinlu on 13 Mar 2018

Okay, I removed gnome and network-manager on a7 to be consistent with v2.

The new build agent "ryzen v2 docker" does not seem to have a D-Bus daemon running like "debian-stable-mm".

The build agent lives within a docker container, hopefully @ingwinlu or @e1528532 can help you starting up the dbus daemon there.

markus2330 on 14 Mar 2018

Thank you. It should be fairly easy. I start it in the docker container (Ubuntu 17.10 artful built from docs/docker/Dockerfile) with the following commands:

mkdir /var/run/dbus # create directory for pidfiles & co
dbus-daemon --system

waht on 14 Mar 2018

👍1

The build server ryzen is unable to connect to our repository again 😭.

sanssecours on 14 Mar 2018

Sorry, my fault. Seems like stopping network-manager was enough to break the system.

It should be fixed now. Please do not hesitate to report any further problems.

markus2330 on 14 Mar 2018

👍1

@markus2330 pretty sure it would have broken again on the next reboot

I took the liberty to redo the /etc/network/interfaces file (and move the configuration of the interfaces into /etc/network/interfaces.d)
that combined with the deinstallation of network manager should hopefully keep it stable

please review the configuration and maybe trigger a reboot to see if it worked

ingwinlu on 14 Mar 2018

❤1 👍1

@ingwinlu Thank you for the fix, the reboot worked nicely.

I found out that I removed too many packages, so I installed Java (openjdk 9 headless and default-jre) again :smile:

@ingwinlu Could you please make dbus run on v2 agent? Ideally please also relocate "/home/armin/buildelektra-stretch/Dockerfile" to some non-user specific destination.

markus2330 on 15 Mar 2018

@markus2330 my proposal for how to proceed with the build environment actually foresees the removal of the current dockercontainer-on-v2 node and replace it with a docker capable one (i.e. no longer pointing to a docker container on v2 but point directly to v2).

Afterwards we can setup the build pipeline to build the image itself from Dockerfiles checked into the repository to provide the different environments needed for tests.

I can prioritize the roll out of a dbus capable docker image when I get to it, but I would not like to do work that will be irrelevant soon if I don't have to.

ingwinlu on 15 Mar 2018

👍1

Yes, that makes sense!

The longer term goal of v2 is that we have to share it with some other docker containers (not for Elektra), so it would be a good thing if all our parts are virtualized in a way that they cannot influence the other docker containers. (maybe recursive docker or Xen?) We could/should do the same for a7 to have identical setups.

We will continue to have access directly on the machines but we should reduce any risk that Jenkins can kill Docker machines it has nothing to do with.

For some agents we already have a Puppet setup. It would be great if we do not bypass it or even better: extend this setup for a7 and v2. I hope @BernhardDenner can give you more info about that soon.

markus2330 on 15 Mar 2018

The build server is down again 🙌.

By the way, we could move the discussion about the build server status to a GitHub Team discussion, since this topic might not be interesting to all people subscribed to this issue.

sanssecours on 16 Mar 2018

Yes, the whole server is down, including the build server :cry: And v2 is down, too (independently). I restarted v2 and changed the C-States option in UEFI. But it seems like there is major problem with our server. Hopefully we get it replaced by better hardware with more memory :crossed_fingers:

GitHub Team discussion

Isn't everyone of ElektraDevelopers notified if we write something in the GitHub Team discussion? Here in this issue everyone can decide if he/she wants to subscribe.

For me a still open question is if we should split this issue up into two issues: hardware related and Jenkins related.

markus2330 on 16 Mar 2018

Isn't everyone of ElektraDevelopers notified if we write something in the GitHub Team discussion?

As far as I can tell, yes.

Here in this issue everyone can decide if he/she wants to subscribe.

That is also true for Team Discussions.

sanssecours on 16 Mar 2018

Build server is up again. Settings changed:

xms and xmx settings to reduce amount of garbage collection when lots of builds in queue
i noticed we use scm polling. I throttled the poller to 4 concurrent polls globally at a time to hopefully reduce a little bit of the spikes the server is getting currently.

EDIT:

* Set # of build to keep to 30 for all pipelines as according to multiple sources those get read when accessing the webui and thus a large number of old builds slows down requests

Update Jenkins to ver. 2.107.1

Update jenkins war file
Disable Use browser for metadata download plugin security setting as it would not allow to update plugins
Update plugins to latest available versions

EDIT 2018-03-18:

Added a second executor to all nodes running on mm

* deprioritised all agents running on mr

Master should be way more responsive now under load. Build all should be closer to 2h as to the 4h+ of before.

EDIT 2018-03-24:
sorry for the delays, busy week...

Added a new job to the jenkinsserver (elektra-jenkinsfile) that will run the Jenkinsfile found in the repo (once it exists)

EDIT 2018-03-28:

Redid the tunnel unit file, now it parses environment files and thus can be adjusted to point to multiple targets
Added v2 server as a slave capable of running docker
- added jenkins user on v2
- installed openjdk-9 on v2

EDIT 2018-03-29:

Fix ulimit settings on jenkins master

ingwinlu on 17 Mar 2018

❤1 🎉1

Although I'm pretty sure @ingwinlu or someone is already on this: It seems like ryzen v2 is misconfigured:

FATAL: Could not apply tag jenkins-BUILD_FULL=ON,BUILD_SHARED=ON,BUILD_STATIC=ON,ENABLE_DEBUG=ON,ENABLE_LOGGER=ON-185
hudson.plugins.git.GitException: Command "git tag -a -f -m Jenkins Build #185 jenkins-BUILD_FULL=ON,BUILD_SHARED=ON,BUILD_STATIC=ON,ENABLE_DEBUG=ON,ENABLE_LOGGER=ON-185" returned status code 128:
stdout: 
stderr: 
*** Please tell me who you are.

Run

  git config --global user.email "[email protected]"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: empty ident name (for <[email protected]>) not allowed

from https://build.libelektra.org/jenkins/job/elektra-multiconfig-gcc47-cmake-options/185/BUILD_FULL=ON,BUILD_SHARED=ON,BUILD_STATIC=ON,ENABLE_DEBUG=ON,ENABLE_LOGGER=ON/console but happened on multiple occasions.

waht on 29 Mar 2018

Oops sorry. It has no deps installed and should only act as a docker host. I will remove the additional flags.
//EDIT: should be done. hopefully that was enough
//EDIT2: I also disabled test-docker for now as it obviously can not find the build images required to run the tests.

But damn the thing is fast if it actually gets something it can build
//EDIT3: renabled test-docker to only run on nodes with docker-prefab tag and gave that tag to ryzen

ingwinlu on 29 Mar 2018

👍2

Sadly the problem seems to be bigger than originally expected.

Some jobs have their nodes hardcoded. Some tags are outdated (stable on jessie nodes). Some jobs did not require the right nodes and where only executed on the correct one because it was executed there once before and successfully built.

The introduction of a new node (ryzen v2 native) scrambled the allocation around a bit even though it should not have.

Please expect some unexpected behaviour till everything is running where it was again.

Changelog:

renamed nodes, <os>-<hostname>-<buildenv>
disabled elektra-multiconfig-gcc47-cmake-options
it actually has not been running on gcc47 slaves for quite some time now, with a mix of gcc49 or gcc63 building depending on where it was scheduled. If renabled it should probably go onto the gcc63 enabled dockercontainer on v2
retagged a ton of jobs (might have missed some of them)
- elektra-todo was requiring stable, but not all stable nodes actually had sloccount installed
- many more similar cases

ingwinlu on 30 Mar 2018

👍1

A7 seems to be down

waht notifications@github.com schrieb am Do., 29. März 2018, 21:24:

Although I'm pretty sure @ingwinlu https://github.com/ingwinlu or
someone is already on this: It seems like ryzen v2 is misconfigured:

FATAL: Could not apply tag jenkins-BUILD_FULL=ON,BUILD_SHARED=ON,BUILD_STATIC=ON,ENABLE_DEBUG=ON,ENABLE_LOGGER=ON-185
hudson.plugins.git.GitException: Command "git tag -a -f -m Jenkins Build #185 jenkins-BUILD_FULL=ON,BUILD_SHARED=ON,BUILD_STATIC=ON,ENABLE_DEBUG=ON,ENABLE_LOGGER=ON-185" returned status code 128:
stdout:
stderr:
* Please tell me who you are.

Run

git config --global user.email "[email protected]"
git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: empty ident name (for jenkins@v2.complang.tuwien.ac.at) not allowed

from
https://build.libelektra.org/jenkins/job/elektra-multiconfig-gcc47-cmake-options/185/BUILD_FULL=ON,BUILD_SHARED=ON,BUILD_STATIC=ON,ENABLE_DEBUG=ON,ENABLE_LOGGER=ON/console
but happened on multiple occasions.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-377344978,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOv-ifc-Ns9q0wuscPa3t8AMo15A07iks5tjTTcgaJpZM4DIApm
.

ingwinlu on 31 Mar 2018

Do you want to work on it? I could reboot it today. As alternative our admin or I could reboot it on Tuesday.

markus2330 on 31 Mar 2018

If it is not too mich of a hassle. Else i cant work on my PR over the
weekend

markus2330 notifications@github.com schrieb am Sa., 31. März 2018, 14:33:

Do you want to work on it? I could reboot it today. As alternative our
admin or I could reboot it on Tuesday.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-377689937,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOv-qBFg1qYb4kI4wGRkjgEywr4VA0Hks5tj3eegaJpZM4DIApm
.

ingwinlu on 31 Mar 2018

Okay, I restarted it and also disabled the C6-State in the BIOS/UEFI (was on enabled). I also removed gnome/xorg (I thought I have already done that?).

Btw. the screen was completely black, so we can only guess what the cause was.

markus2330 on 31 Mar 2018

these have been popping up on a7 every 10 minutes or so:

Apr 04 07:14:23 a7 kernel: [Hardware Error]: Corrected error, no action required.
Apr 04 07:14:23 a7 kernel: [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc
Apr 04 07:14:23 a7 kernel: [Hardware Error]: Error Addr: 0x00000003705a2f00
Apr 04 07:14:23 a7 kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x0000015c0a400f03
Apr 04 07:14:23 a7 kernel: [Hardware Error]: Unified Memory Controller Extended Error Code: 0
Apr 04 07:14:23 a7 kernel: [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
Apr 04 07:14:23 a7 kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x700b45 offset:0xf00 grain
Apr 04 07:14:23 a7 kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
lines 5977-6036/6036 (END)

Those might be the reasons for the a7 downtime as well as some of the strange build behaviours on a7 latelty

ingwinlu on 4 Apr 2018

Disabled valgrind section in elektra-ini-mergerequests as it was run via make run_memcheck

ingwinlu on 4 Apr 2018

these have been popping up on a7 every 10 minutes or so

Yes, we already saw them. When the computer was bought, someone actually checked if ECC works and no such errors occurred back then. Are the frequency of these errors somehow dependent on the load of the system?

Disabled valgrind section in elektra-ini-mergerequests as it was run via make run_memcheck

Thanks for cleaning this up.

markus2330 on 4 Apr 2018

I am having random outtakes (containers crashing, builds stopping in the middle, disconnects, ...) on a7 without any 'real' logs again. only the already mentioned memory corrections.

ingwinlu on 5 Apr 2018

Thank you, seems like something is very wrong. And we now also have uncorrectable errors:

Apr  5 09:50:40 a7 kernel: [39549.503787] mce: Uncorrected hardware memory error in user-access at 73d6ce880
Apr  5 09:50:40 a7 kernel: [39549.503794] [Hardware Error]: Uncorrected, software restartable error.
Apr  5 09:50:40 a7 kernel: [39549.505882] [Hardware Error]: CPU:2 (17:1:1) MC0_STATUS[-|UE|MiscV|-|AddrV|-|Poison|-|-|UECC]: 0xbc002800000c0135
Apr  5 09:50:40 a7 kernel: [39549.506581] [Hardware Error]: Error Addr: 0x000000073d6ce880
Apr  5 09:50:40 a7 kernel: [39549.507287] [Hardware Error]: IPID: 0x000000b000000000
Apr  5 09:50:40 a7 kernel: [39549.507980] [Hardware Error]: Load Store Unit Extended Error Code: 12
Apr  5 09:50:40 a7 kernel: [39549.508677] [Hardware Error]: Load Store Unit Error: DC data error type 1 (poison consumption).
Apr  5 09:50:40 a7 kernel: [39549.509378] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Apr  5 09:50:40 a7 kernel: [39549.510136] Memory failure: 0x73d6ce: Killing java:1470 due to hardware memory corruption
Apr  5 09:50:40 a7 kernel: [39549.510908] Memory failure: 0x73d6ce: recovery action for dirty LRU page: Recovered

markus2330 on 5 Apr 2018

there goes a7 again.

on a more productive front: I installed the blue ocean frontend to jenkins. Preview

ingwinlu on 5 Apr 2018

🎉1

there goes a7 again.

Its really frustrating. I restarted the machine and reconnected the agents. I'll ask our admin to replace the RAM tomorrow, so expect downtimes tomorrow.

on a more productive front: I installed the blue ocean frontend to jenkins. Preview

Looks great! Maybe you can show it to us in the next meeting?

markus2330 on 5 Apr 2018

Our admin is already in the weekend. I'll reboot a7 and reset BIOS/UEFI to factory defaults. If the errors continue over the weekend he will hopefully exchange the RAM.

EDIT: No build job was running, thus no build job was canceled.

EDIT: Everything is up again. No Memory errors so far.

markus2330 on 6 Apr 2018

❤1

Looking much better. Did somebody watch too much linus tech tips and overclock the server?

ingwinlu on 6 Apr 2018

😄1

Sorry, I had to stop Jenkins for some time. The server had load 20 and things that I needed to do were not possible anymore (>1h waiting time, then I gave up and stopped Jenkins).

Is it possible that even non-started build jobs need RAM? (the queue list was very long) Otherwise the local build jobs are the best candidate for these problems. (3 local build jobs were running)

@ingwinlu Ideally, we do not build anything on that server. Furthermore, can we make build jobs dependent on the load on a server. (Do not start jobs on a server with load > 5?).

I started everything again but I hope we find a quick solution for that.

markus2330 on 6 Apr 2018

I pumped the VERSION and CMPVERSION in the Jenkins system settings.

@ingwinlu It would be great if we had these settings also within Jenkinsfile.

markus2330 on 7 Apr 2018

@markus2330 see 8de9272051fe903a7df08f0abdf18879701f7ac9 for an example on how to achive this in an Jenkinsfile

ingwinlu on 7 Apr 2018

❤1

Removed make run_memcheck from the following targets due to them failing since some time and finally showing up in the build system thanks to #1882

gcc-configure-debian-stretch-minimal
gcc-configure-debian-wheezy
elektra-gcc-i386

ingwinlu on 8 Apr 2018

👍1

restrict elektra-gcc-configure-debian-stretch to run on nodes: stretch && !mr

ingwinlu on 9 Apr 2018

Update jenkins master to ver. 2.107.2 + update all plugins

ingwinlu on 12 Apr 2018

👍1

I tried to add allowMembersOfWhitelistedOrgsAsAdmin to all build jobs today but seems like I can still not trigger a build all (see #1863) properly and only some jobs get executed

@markus2330 https://github.com/janinko/ghprb/issues/416#issuecomment-266254688

ingwinlu on 13 Apr 2018

Can someone please

fix
disable, or
(even better) delete

elektra-clang-asan please 🙏. Currently the build job fails although all the failing tests:

testlib_notification
testshell_markdown_base64
testshell_markdown_ini
testshell_markdown_mini
testshell_markdown_tcl
testshell_markdown_xerces
testshell_markdown_tutorial_validation

work just fine on Travis.

sanssecours on 16 Apr 2018

They don't test the same as (for example) they use different clang versions...

Since this thread is absolutly untrackable for bug reports or longer discussions I will open up new issues for clang and clang-asan as soon as i get to it.

ingwinlu on 16 Apr 2018

👍1

They don't test the same as (for example) they use different clang versions...

I agree, while Travis uses the outdated clang 5.0.0, the Clang version on elektra-clang-asan is ancient (3.8.1). I do not see the value of an ASAN enabled build job for such an old compiler.

sanssecours on 16 Apr 2018

I created #1919 for the failing "testlib_notification" test on "elektra-clang-asan".

waht on 16 Apr 2018

👍1

I tested a build all with all mr agents disabled and the master was perfectly responsive while a build all with all mr agents enabled actually timed out some of the builds. Hence #1866 will definitely provide an improvement if we can get rid of all the mr agents (-wheezy as it is not containerizable)

ingwinlu on 21 Apr 2018

Further testing shows it is pretty much only the homepage builds job. I removed it from build all for now so it only gets run explicitly.

It will be replaced with a containerized solution.

ingwinlu on 21 Apr 2018

v2 is online again with latest BIOS.

Please report any segfaults here (the CPU might be buggy).

markus2330 on 4 May 2018

a7 seems to be down?

On 4 May 2018 at 11:10, markus2330 notifications@github.com wrote:

v2 is online again with latest BIOS.

Please report any segfaults here (the CPU might be buggy).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-386545292,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOv-qhlMoQ78eNfpLpzXEBLTcq0pKT5ks5tvBsXgaJpZM4DIApm
.

ingwinlu on 4 May 2018

👍1

a7 is up again with latest BIOS

markus2330 on 4 May 2018

a7 down again?

ingwinlu on 7 May 2018

😕1

Yes, it had crashed. It showed the login prompt without any error message and no reaction at all to any input (including sys-req). Only hard-reset helped.

If you have any idea what the problem could be, please tell.

markus2330 on 7 May 2018

sadly no persistent journal is set up on a7 so no logs

ingwinlu on 7 May 2018

😕1

when can we expect a7 and v2 to be available again?

ingwinlu on 9 May 2018

Ohh, I did not know they were down. I'll ask our admin to reboot and if he is not able to I will do it at about 17:00.

Edit: He said he will reset them right now.

Edit: They are both up again.

markus2330 on 9 May 2018

👍1

i rebooted a7 and v2 manually today. it seems v2 is not reachable anymore. can you please it is actually running?

//EDIT: solved by fixing network configuration on both machines

ingwinlu on 10 May 2018

apparently a7 has gone down again.

ingwinlu on 12 May 2018

😕1

Ok, I'll reboot her. Otherwise we will never get this release done.

markus2330 on 12 May 2018

any indication of the cause? just networking issues or was the machine unresponsive again?

ingwinlu on 12 May 2018

Everything should be up and running now.

I just got it at the right moment, there were some logs until it finally completely freezed.

The logs were:

INFO: rcu_sched detected stalls on CPUs/tasks:
...
rcu_sched kthread starved for 7770 jiffies
watchdog: BUG soft lockup - CPU#2 stuck for 22s! [docker-gen]
... (many repetitions)
NMI watchdog: Watchdog detected hard LOCKUP on cpu 2

markus2330 on 12 May 2018

That could be anything. from the faulty ryzen cpu to a bad psu :(

On 12 May 2018 at 14:33, markus2330 notifications@github.com wrote:

Everything should be up and running now.

I just got it at the right moment, there were some logs until it finally
completely freezed.

The logs were:

INFO: rcu_sched detected stalls on CPUs/tasks:
...
rcu_sched kthread starved for 7770 jiffies
watchdog: BUG soft lockup - CPU#2 stuck for 22s! [docker-gen]
... (many repetitions)
NMI watchdog: Watchdog detected hard LOCKUP on cpu 2

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-388552175,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOv-roPlXhrY0w_CFAmnRRDjVJgHQhSks5txtaugaJpZM4DIApm
.

ingwinlu on 12 May 2018

About #1993

@ingwinlu Please disable jobs if they currently do not succeed (or at least do not trigger them by default nor with "jenkins build all please"). It should have high priority that we do not fail jobs in PRs where actually everything is okay. (asan failing for quite some time was not a good situation)

If they fail because of some Jenkins action you did, restarting the jobs would be nice :heart: Saying so here in #160 is also okay.

markus2330 on 14 May 2018

❤1

v2 is up again, with a new CPU!

Please submit many jobs for stress-testing :smile:

markus2330 on 16 May 2018

🎉1

a7 seems to be down again.

ingwinlu on 17 May 2018

Thank you, everything is up again.

markus2330 on 17 May 2018

I restarted a7 again.

Having a circuit which resets a7 every hour would most likely increase availability.

markus2330 on 17 May 2018

😕1

when can we expect to see a7 back online again?

ingwinlu on 22 May 2018

a7 is back online

markus2330 on 22 May 2018

🎉2

v2 is back online

markus2330 on 23 May 2018

🎉1

The power supply of v2 will be exchanged in about 1h.

@ingwinlu can you disable the agents and enable them again afterwards? (If you are currently working on it.)

markus2330 on 23 May 2018

v2 agents are disabled for now

ingwinlu on 23 May 2018

👍1

v2 is running again. It has a new power supply. Please submit many jobs to stress test the machine.

markus2330 on 23 May 2018

I updated jenkins plugins. The resulting reboot partly restored old jenkins nodes (before they were renamed) resulting in broken builds everywhere as git repositories were broken.

I cleaned up affected repositories and cleaned out cached docker containers just to be sure.

ingwinlu on 23 May 2018

❤1

a7 is down again and as such docker based builds are not available again.

ingwinlu on 24 May 2018

I am also rolling back the update to xunit as it violates security restrictions.

ingwinlu on 24 May 2018

I rebooted a7 and reconnected all agents.

markus2330 on 24 May 2018

❤1 🎉1 👍1

docker builds are not available as a7 is down again.

ingwinlu on 25 May 2018

👍1

I rebooted the server and reconnected the agents.

I think replacing a7 is the best way forward, see #2020

markus2330 on 25 May 2018

👍1

I think its down again, my latest commit resulted in Cannot contact a7-debian-stretch: java.lang.InterruptedException

e1528532 on 25 May 2018

👍1

@e1528532 Thank you for writing it here! If you want, you can also vote in #2020

I restarted a7 and reconnected the agents.
Let us hope for the best that no problems occur during the weekend.

markus2330 on 25 May 2018

👍1

a7 is down again :cry: Surprising that it nearly worked the whole weekend. Might be the uptime record of the week. Nevertheless #2020 did not get many comments.

markus2330 on 27 May 2018

I restarted a7 (it reacted to sysctl) and someone else started jenkins. Everything is up and running again.

markus2330 on 28 May 2018

🎉1

a7 just went down again.

ingwinlu on 28 May 2018

Thank you for the info! I restarted a7 and reconnected the agents.

Does it make sense that we have the agent "debian-jessie-minimal"? I think you can remove it safely once it is integrated in Docker. (and it seems like it already is)

EDIT:
In https://build.libelektra.org/jenkins/computer/a7-debian-stretch/log
and https://build.libelektra.org/jenkins/computer/v2-debian-stretch/log
there are warnings:

WARNING: LinkageError while performing UserRequest:hudson.Launcher$RemoteLauncher$KillTask@544b40e
java.lang.LinkageError
    at hudson.util.ProcessTree$UnixReflection.<clinit>(ProcessTree.java:710)
    ...
Caused by: java.lang.ClassNotFoundException: Classloading from system classloader disabled
    at hudson.remoting.RemoteClassLoader$ClassLoaderProxy.fetch4(RemoteClassLoader.java:854)

markus2330 on 28 May 2018

bad news. v2 also went down.

EDIT: but I seem to be able to connect to it via ssh....
EDIT2: issued a reboot on v2 but now I can no longer connect. still pingable from a7 though...

EDIT3: now a7 is down as well.

ingwinlu on 28 May 2018

Thank you for reporting! I rebooted a7 and v2. We should reconsider #2020.

markus2330 on 29 May 2018

i think v2 is down again:

Cannot contact v2-debian-stretch: java.lang.InterruptedException

e1528532 on 29 May 2018

With v2 everything was fine but a7 was down again. All agents are now online again.

markus2330 on 29 May 2018

v2 seems to be semi unresponsive again. pingable from a7 but no ssh action at all. should be the btrfs issues from the sympthoms alone. can you reboot everything before you go home for the weekend?

ingwinlu on 1 Jun 2018

@ingwinlu Thank you, @waht and I rebooted v2 successfully but I cannot connect the agents (" Connection refused (Connection refused)"). Any idea what is wrong here? (interactive ssh login works)

markus2330 on 1 Jun 2018

ssh did only work when not connecting via the bridge.

after restarting the bridging service the connection could be established.

seems like the ssh-tunnel did end up in a weird undefined state and did not kill itself. Not sure why it did not kill itself (serveraliveinterval is on).

ingwinlu on 1 Jun 2018

👍1

I also needed to manually cleanout all workspaces on v2 as the fs was corrupted and all build jobs on them failed.

ingwinlu on 1 Jun 2018

a7 seems to be down. I am not sure if I can reboot it before monday.

markus2330 on 2 Jun 2018

I restarted a7 and v2. (v2 because there were error messages on a7 that it cannot create the ssh bridge to v2. But also after restart of v2 the same error messages occured. Nevertheless, the ssh bridge seems to work. Maybe some dependency (network?) is missing in the bootup scripts of a7?)

markus2330 on 4 Jun 2018

Maybe some dependency (network?) is missing in the bootup scripts of a7

No it is there.

it cannot create the ssh bridge to v2

I believe this behaviour occurs when the v2 kernel starts to be unresponsive (and thus no incoming network connections can be established). In the past you mentioned logs indicating btrfs errors on v2.

ingwinlu on 4 Jun 2018

I prepare the build server for a shutdown to replace the CPU later.

markus2330 on 5 Jun 2018

a7 and v2 are back again (a7 with new CPU, v2 with its root filesystem checked)

markus2330 on 5 Jun 2018

🎉1

while it kept up during the day (with consistent buildign) it seems like a7 just went down again.

ingwinlu on 6 Jun 2018

Yes, I'll restart tomorrow morning.

We should again discuss #2020.

markus2330 on 6 Jun 2018

I restarted a7. It was again a CPU hang.

markus2330 on 7 Jun 2018

a7 is down again.

ingwinlu on 7 Jun 2018

Thank you, I restarted it. All agents are again online.

markus2330 on 8 Jun 2018

Looks like some of the debian nodes are down and thus a few PRS are waiting for a long time already for the test execution to start. Is that intended?

e1528532 on 8 Jun 2018

👍1

Looks like some of the debian nodes are down and thus a few PRS are waiting for a long time already for the test execution to start. Is that intended?

During an upgrade on the mm-debian-unstable node the machine became unresponsive and we have not been able to connect to it since. Since the owner is not responding to our emails it will probably be gone forever.

While we have ported over a good amount of build jobs to the new system already the ones missing are those that are currently hanging in the queue.

ingwinlu on 8 Jun 2018

I disabled the affected jobs and marked them as to be replaced + removed from the docs

ingwinlu on 8 Jun 2018

👍1

@ingwinlu Thank you for taking care of this!

Readding the xdg tests is quite important if someone wants to work on the resolver. I added it at the top post here. Can you update what you already have achieved in the check list above?

markus2330 on 8 Jun 2018

This time v2 is the lucky winner.

@markus2330 i cleaned up the top post.

ingwinlu on 10 Jun 2018

❤1

Thank you for the cleanup! I'll reboot v2 (and maybe a7, let us see) tomorrow in the morning.

markus2330 on 10 Jun 2018

I rebooted it. There was no reaction and no message. See #2020

markus2330 on 11 Jun 2018

Please reboot a7 and v2.

ingwinlu on 12 Jun 2018

👍1

I rebooted a7. I did not find any problem with v2, should I reboot it anyway?

i7 is now available at 192.168.173.96 can you please create a bridge via a7 to it?

markus2330 on 12 Jun 2018

You need to create a ssh-tunnel-a7-v2 user on the machine or create an account for me.

ingwinlu on 12 Jun 2018

👍1

We added an additional build slave i7-debian-stretch that will help with libelektra buildjobs.

ingwinlu on 12 Jun 2018

🎉2

v2-docker-buildelektra-stretch (offline) as no more build jobs are scheduled on there. The ssh-bridge on a7 that exposed the agent has also been disabled.

ingwinlu on 15 Jun 2018

👍1

Hello @ingwinlu ,
as discussed in the last meeting, I would need the access point.
Could you please send me the information per email, my email is in AUTHORS.md.
Btw. could not find your email anywhere, maybe you should add yourself to this file.

KurtMi on 25 Jun 2018

👍1

Our v2 build server will be offline till 31.07 09:00 as we are running benchmarks on it. Expect longer build times.

Sorry for the inconvenience.

//EDIT: extended downtime by 2 hours

ingwinlu on 30 Jul 2018

Seems like the extension now also passed. Would be good to have back the fast builds after lunch (about 13:00). :smile:

markus2330 on 31 Jul 2018

mm-debian-unstable was upgraded and is back online. Are there jobs which we can reactivate and pin to the server?

markus2330 on 7 Aug 2018

It seems like i7's disk is full. My jobs fail with No space left on device (Job and Job)

I really like the new build interface (dockerization & jenkinsfile) by the way. Very helpful for reproducing build errors.

waht on 13 Aug 2018

❤1

Thank you for reporting!

Unfortunately resizing would require a restart (rootfs needs to be made smaller before the other could be extended) and only would bring 20G. I removed the Jenkins build folders but they were tiny, so we are still at 99%.

So cleaning up _docker would be more effective:
@ingwinlu seems like both _docker/overlay2 is huge. Any idea why all this stuff was collected there?

markus2330 on 13 Aug 2018

I force cleaned docker artifacts with docker system prune -fa. This
cleaned up around 190GB of space used in docker images.

On Mon, 13 Aug 2018 at 07:54, markus2330 notifications@github.com wrote:

Thank you for reporting!

Unfortunately resizing would require a restart (rootfs needs to be made
smaller before the other could be extended) and only would bring 20G. I
removed the Jenkins build folders but they were tiny, so we are still at
99%.

So cleaning up _docker would be more effective:
@ingwinlu https://github.com/ingwinlu seems like both _docker/overlay2
is huge. Any idea why all this stuff was collected there?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-412415127,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOv-oK9dSc27dbXOwgSq4xSWa4IXwiUks5uQRSwgaJpZM4DIApm
.

ingwinlu on 13 Aug 2018

Thank you!

Can we put this in libelektra-daily or into a cronjob?

markus2330 on 13 Aug 2018

Daily does something similar but less aggressive. Will have to take another
look when I am back in Vienna.

markus2330 notifications@github.com schrieb am Mo., 13. Aug. 2018, 09:22:

Thank you!

Can we put this in libelektra-daily or into a cronjob?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-412430076,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOv-mO_0_b9gGl82qc56LbMRRiIC7Mhks5uQSkzgaJpZM4DIApm
.

ingwinlu on 13 Aug 2018

👍1

As you might have noticed the build server was down in the morning (any maybe in the night). The power supply unit was damaged and is now replaced.

Furthermore a7 or v2 might be go offline for benchmarks for short durations in the next week. You will see the offline-message "benchmark" if anton starts benchmarks. If the computer stay offline for too long (e.g. more than one day) please contact us. (Then it might have been forgotten to toggle it to online again.)

markus2330 on 21 Aug 2018

Resolved an issue with our sid image.

testkdb_allplugins segfaulted in our sid image during debian-unstable-full-clang tests but only when executed on v2. I manually updated the image to use the latest available packages and pushed it onto our registry.

https://build.libelektra.org/jenkins/blue/organizations/jenkins/libelektra/detail/master/242/pipeline/411/ passed, but will keep an eye on it.

The issue has been mentioned in #2216 and #2215 (@mpranj @sanssecours).

ingwinlu on 2 Sep 2018

👍2

I implemented public access to our docker registry. See here for some documentation on how to access it.

Let me know if something does not work as expected.

//EDIT it seems as if pushing does not work even though login succeeds.
//EDIT2 public access is disabled again. https://github.com/moby/moby/issues/18569. restored functionality to build system
//EDIT3: public repo is up again at hub-public.libelektra.org

ingwinlu on 2 Sep 2018

❤1 😕1 🎉1

Anton wants to replace the mainboard of the computer where hyper-threading is deactivated next Tuesday or Wednesday (we can choose). Does someone needs the buildserver on one of these two days?

markus2330 on 15 Oct 2018

i restarted the docker registry and run a cleanup manually. hopefully that resolved any build issues with the website image.

ingwinlu on 6 Nov 2018

👍2 ❤1

@ingwinlu thank you for the maintenance work!

Unfortunately the WebUI stage fails quite reliably, e.g. 321 or 320 (failed even earlier but also when pulling WebUI?).

I am more and more convinced that the canceling of jobs on master is a bad idea. We have nearly no successful builds on master because the builds are either canceled or fail due to network problems. In the commit history it is hard to tell what happened because either way they are simply shown as failure.

As workaround I disabled the unreliable stages in c3b59ecef95287ffc33b094b37e03d0ec6b5710f but I hope we can enable them soon again!

markus2330 on 18 Nov 2018

Should a7-debian-stretch still be offline for the benchmarks? (taken offline since Feb 21, 2019 10:47:56 AM)

mpranj on 6 Mar 2019

Thank you for reporting! Seems like Anton forgot to reactivate. I activated the node again and I also removed the old nodes (except of mm as they are still running).

markus2330 on 7 Mar 2019

👍1

There was a downtime of our server in the morning. Everything runs again but we got an offer that they will exchange the hardware. So most likely we will have another downtime of about an hour today.

markus2330 on 26 Mar 2019

The server is up again. Unfortunately we got the same hardware. If somebody has time for the installation/setup, we can upgrade the hardware.

markus2330 on 26 Mar 2019

Looks like the Jenkins builds are quite slow (multiple hours for a full build). As far as I can tell, only a7-debian-stretch and i7-debian are executing tests, while all the other nodes are idle. Is this expected behavior?

sanssecours on 7 May 2019

Thank you for reporting this problem!

No, this is not expected behavior. It seems like v2 is down. I'll reboot it asap.

markus2330 on 7 May 2019

👍1

v2 should now be up again

markus2330 on 7 May 2019

🎉1

v2 is down and I am afraid it will stay like this until Monday. Builds will be very slow till then.

markus2330 on 1 Jun 2019

v2 is again up with a new mainboard

markus2330 on 4 Jun 2019

🎉1

I'll upgrade all 3 build agents (i7, v2, a7, in that order) to Buster to avoid #2852

I'll try to keep the downtime as minimal as possible. Build jobs might fail, please restart them (after the agents are up again).

markus2330 on 13 Aug 2019

👍1

i7-debian-buster, former i7-debian-stretch is online again

v2 is next

markus2330 on 13 Aug 2019

v2-debian-buster and a7-debian buster are also online again

I restarted the previous successful build job on master to see if everything works again:
https://build.libelektra.org/jenkins/blue/organizations/jenkins/libelektra/detail/master/853/pipeline

Furthermore, I added the PR https://github.com/ElektraInitiative/libelektra/pull/2876 to enable buster build jobs.

markus2330 on 14 Aug 2019

🎉1

v2 seems to be down (also ssh connection fails), unfortunately I am not in Vienna. I hope our sysadmin will fix it tomorrow.

markus2330 on 14 Aug 2019

a7 is now also down and with it i7 (which is connected via bridge over a7).

So currently no builds can be performed. I contacted the administrator.

markus2330 on 15 Aug 2019

👍1

All servers are up again. Please restart the builds by either pushing new commits or writing "jenkins build libelektra please" as comment to your PR.

Technical note: a7 got down because of "watchdog bug soft lockup". I tried to add "nomodeset nmi_watchdog=0". Let us hope they are not again as unstable as they have already been.

markus2330 on 16 Aug 2019

🎉1

a7 (and also v2 and i7 because they are connected via a bridge over a7) are down. I contacted our admin. Please try to avoid to start builds now, as it will only make a long queue.

markus2330 on 20 Aug 2019

a7 is back online (was already yesterday), v2 was not affected

markus2330 on 22 Aug 2019

According to the build server status page the servers:

a7-debian-buster,
i7-debian-buster, and
v2-debian-buster

are down. The Jenkins data directory seems to be pretty full too. And while we are at it: It would be great, if we could upgrade Jenkins and it’s plugins. I am interested in fixing these problems. There are two problems though.

I do not have much (or really any) experience administering servers.
I would probably need physical access to the machines, since they seem to be quite unstable.

sanssecours on 27 Aug 2019

a7 is a bridge to i7 and v2, so with a7 being down we do not know about i7 and v2.

The access is no problem, I can give it to you. But you should be aware that upgrading Jenkins is a large operation as it usually requires to reconfigure Jenkins (according to release notes, which are many as we did not upgrade since a while) and to fix the Jenkinsfile (according to API changes from the plugins). Drop me an email if you want access.

markus2330 on 27 Aug 2019

a7 (and all others) are up again.

markus2330 on 28 Aug 2019

🚀1 🎉1

a7 is down again, I contacted the admin.

markus2330 on 28 Aug 2019

Technical note: a7 got down because of "watchdog bug soft lockup".

A quick search suggests that this might be a BIOS problem. Did anybody check if there are BIOS updates available?

a7 is a bridge to i7 and v2, so with a7 being down we do not know about i7 and v2.

That seems like a bad design. Is there no way around that?

kodebach on 29 Aug 2019

A quick search suggests that this might be a BIOS problem. Did anybody check if there are BIOS updates available?

We installed a new BIOS, replaced the CPU and upgraded the kernel (see messages above). The system was stable since then. Now after upgrade to Debian buster it is unstable again.

Nevertheless, I asked the admin if there is a new BIOS available.

That seems like a bad design. Is there no way around that?

i7 and v2 are in a private network as there are not enough IPv4 addresses. I asked our admin if maybe a IPv6 setup would be possible.

markus2330 on 29 Aug 2019

I asked our admin if maybe a IPv6 setup would be possible.

We don't really need IPv6, it would be enough to use another more stable server as the bridge.

kodebach on 29 Aug 2019

Most likely, v2 is as unstable as a7 (there was only one crash but this does not say much because immediately when a7 dies, it takes the load of v2). We could use i7 as bridge. But if v2 and a7 goes down, i7 is also not of much use, it would take many hours to finish a build job. Furthermore, i7 does not have enough space for the docker registry.

So this change would be a lot of effort with little gain.

To fix the actual problem of a7 and v2 is much more promising.

markus2330 on 29 Aug 2019

Furthermore, i7 does not have enough space for the docker registry.

In that case the only solution is to fix a7.

kodebach on 29 Aug 2019

Unfortunately a7 is down again :cry:

I tried to boot with the old kernel but this did not help.

For BIOS there are some other versions available but according to their release notes there is little hope that they will fix this problem and there is no way to downgrade again if it would get worse...

markus2330 on 1 Sep 2019

The BIOS for a7 is currently upgraded. Furthermore, we will try to use a newer kernel from backports.

a7 will hopefully be up again soon.

markus2330 on 2 Sep 2019

The new BIOS did not help, now a7 crashed within minutes.

markus2330 on 2 Sep 2019

a7 is down again, I contacted the admin. The newer kernel from backports will be tried in the next reboot.

markus2330 on 4 Sep 2019

a7 is up again with the 5.2 kernel

markus2330 on 4 Sep 2019

🎉1

I think it crashed again...

Do we still get the same error messages or is there at least some change?

kodebach on 4 Sep 2019

Yes, a7 down again, I reported to the admin. He will tell us about the messages when restarting.

Does someone have another idea? (We upgraded BIOS and Kernel already.)

markus2330 on 5 Sep 2019

Some sources suggest problems with the nouveau graphics driver and that we should try nouveau.modeset=0 (somehow this is different to nomodeset). Also disabling "C-states" in the BIOS was suggested.

kodebach on 5 Sep 2019

👍1

Yes, a7 down again, I reported to the admin. He will tell us about the messages when restarting.

Does someone have another idea? (We upgraded BIOS and Kernel already.)

maybe disable a7 as a jenkins slave to determine if it only occurs when 'real' load is on the machine.

ingwinlu on 5 Sep 2019

@ingwinlu thank you, good idea. I reduced now to one build job an a7 (it was 2). For the weekend (once the admin left the office) I will then disable the agent completely.

@kodebach: thank you, I will forward the information to the admin.

markus2330 on 6 Sep 2019

Is there any timeline for when a7 will be up again?

sanssecours on 6 Sep 2019

a7 is up again, with hyperthreading turned off and only one concurrent build job.

markus2330 on 6 Sep 2019

👍1

We could also take some of the load off a7, by moving the alpine and ubuntu-xenial builds to Cirrus. Both of them are simple "build and test" runs. They aren't doing anything special like reporting coverage.

Cirrus allows 8 concurrent Linux builds per user. Currently the linkchecker build is the only Linux build on Cirrus.

In fact ubuntu-xenial is a bit redundant, since our Travis builds run on Ubuntu Xenial.

kodebach on 6 Sep 2019

👍1

Thank you for the tips but we do not plan to offload any Linux-builds away from Jenkins. On contrary, @Mistreated will work on improving our Jenkins infrastructure to be even more up-to-date and useful (e.g. by building more packages). The advantages of our Jenkins are:

we have it fully under our control
we can easily scale it up (only Java+Docker is required on build agents)
the Jenkinsfile is very neat and (for the most parts) quite easily extensible

But of course, everyone is welcome to also extend Cirrus (or any other additional build system which is offered for free, see #1540).

markus2330 on 6 Sep 2019

It was just meant as a temporary solution, to counteract disabling hyperthreading and limiting to 1 concurrent job on a7.

kodebach on 6 Sep 2019

a7 only builds a small part (about 2/5), so the reduction by half should be barely noticeable. Or are there any specific problems now? (At the moment, of course, it takes time to catch up with the many jobs from the downtime.)

markus2330 on 6 Sep 2019

a7 only builds a small part (about 2/5)

2/5 is 40%. I wouldn't consider that a small part.

Or are there any specific problems now?

No, in fact it seems to be working better than before.

kodebach on 6 Sep 2019

❤1

Sorry, I meant about 2/6 (1/6 is i7, 3/6 is v2). And this part is not removed but only reduced.

No, in fact it seems to be working better than before.

Perfect!

markus2330 on 6 Sep 2019

Seems like a7 is down again 😭.

sanssecours on 9 Sep 2019

Thank you! I reported it.

In future it would be excellent if the one who detects first can directly report it to

herbert at complang.tuwien.ac.at

It is enough to say, that "a7 ist leider nicht erreichbar".

And then also report it here, so that herbert does not get several emails.

markus2330 on 10 Sep 2019

👍1

Surely there is a way to make the Jenkins master server send such an email automatically and maybe also post to this GitHub issue. It would be very weird, if there was no Jenkins plugin for such a simple task...

kodebach on 10 Sep 2019

Yes, there is https://wiki.jenkins.io/display/JENKINS/Mail+Watcher+Plugin but I am not sure that it exactly does what we want. It might also send emails when someone puts the agent off on purpose. And a personal email is much more likely to be handled by the admin quicker.

If we automate something, then directly the rebooting of the PCs if they are unreachable (maybe they even have some kind of watchdog built-in?)

markus2330 on 10 Sep 2019

👍1

a7 has been restarted and the "global C-State control" disabled.

It is, however, not online as build agent.

Let us see if it also crashes without load. v2 and i7 work again.

markus2330 on 11 Sep 2019

🎉1

The admin (Herbert) is not available tomorrow, so I leave a7 off as build agent for now.

My plan (if there are no protests or a7 crashes before) is to turn a7 on as build agent tomorrow. If a7 crashes then again, Herbert can restart a7 on Friday. Is this okay?

markus2330 on 11 Sep 2019

If the queue isn't too long, I think we should keep the build agent on a7 disabled for a bit longer. The last crash happend after 3 days. If we enabled it tomorrow, we won't know whether the build agent caused the crash or not, unless it crashes before then.

kodebach on 11 Sep 2019

Ok, then let us see how the queue size looks like.

I hope that "global C-State control" finally fixes the problem and I think we need high load to test it.

markus2330 on 11 Sep 2019

The queue was very long and the master builds were all hanging as they need a7 for website deployment.

So I started the a7 agent again.

markus2330 on 12 Sep 2019

👍1

Some of the recent Jenkins build jobs were canceled because of missing disk space on the main build server. I freed some space by removing logs of old build jobs. Please note that I might have also removed some log files of new build jobs. In some cases the Jenkins build for your latest commit might have failed and you now only see some message abut an 404 error. In that case, please just either

use jenkins build libelektra please in a comment below the PR to restart the Jenkins build, or
rewrite the last commit without changes (e.g. using git commit --amend) and do a force push

. Sorry for the inconvenience.

sanssecours on 12 Sep 2019

❤1

Thank you for maintaining it!

markus2330 on 12 Sep 2019

I marked node v2-debian-buster as temporarily offline, since it does not seem to work correctly. For more information, please take a look at issue #2995 (and issue #2994).

sanssecours on 22 Sep 2019

Thank you for looking for the infrastructure!

v2 was out of disc space. I executed docker system prune Total reclaimed space: 58.37GB

Then I rebooted v2 and made the agent connect again.

I now executed du -h | sort -h to find other files to be removed.

markus2330 on 22 Sep 2019

👍2

I started v2 again with a new Docker version. Please report broken builds immediately.

markus2330 on 24 Sep 2019

🎉1

I reinstalled docker, purged all configs, and removed /var/lib/docker. Hopefully this fixes it.

v2 will be included again.Please report broken builds immediately.

markus2330 on 24 Sep 2019

As suggested here I now executed

ethtool -K enp3s0 sg off # on v2
ethtool -K enp0s25 sg off # on i7
ethtool -K enp37s0 sg off # on a7 (internal network interface)

and I also restarted i7 (there were many docker network interfaces, they are gone now)

markus2330 on 24 Sep 2019

docker-ce is now everywhere 5:19.03.1~3-0~debian-buster

markus2330 on 24 Sep 2019

Please report broken builds immediately.

Looks like the master build failed again, because of connection problems to v2-debian-buster (see also issue #2995).

sanssecours on 24 Sep 2019

I asked our admin to look at the switch between a7/v2/i7. I deactivated v2 and i7 for now.

I restarted libelektra/master and libelektra-daily.

markus2330 on 24 Sep 2019

❤1 👍1

We changed the ports for all 3 PCs.

Then I removed jenkins homedir on v2/i7 and restarted the v2/i7 agent.

markus2330 on 24 Sep 2019

Looks like there is no more space available on v2-debian-buster:

ApplyLayer exit status 1 stdout: stderr: write /app/kdb/build/src/tools/kdb/CMakeFiles/kdb-objects.dir/gen/template.cpp.o: no space left on device

.

sanssecours on 25 Sep 2019

❤1

Thank you for reporting, I made (much) more space on v2.

markus2330 on 25 Sep 2019

👍1

Remove job finished:

Filesystem Size Used Avail Use% Mounted on
/dev/sda3 417G 227G 164G 58% /

markus2330 on 25 Sep 2019

buildserver is down due to migration (so that we get consistent state in the backup of the new buildserver)

markus2330 on 27 Sep 2019

https://build.libelektra.org/jenkins/ is starting again

markus2330 on 27 Sep 2019

Load on build server was 200 due to kernel errors during a backup, the server did not react anymore and needed to be reset.

Log messages were (examples):

[87400.120008]  [<ffffffff810be6a8>] ? find_get_page+0x1a/0x5f

[87372.120005]  [<ffffffff81357f52>] ? system_call_fastpath+0x16/0x1b
[87372.120005] Code: f6 87 d1 04 00 00 01 0f 95 c0 c3 50 e8 d7 36 29 00 65 48 8b 3c 25 c0 b4 00 00 e8 d0 ff ff ff 83 f8 01 19 c0 f7 d0 83 e0 fc 5a c3 <48> 8d 4f 1c 8b 57 1c eb 02 89 c2 85 d2 74 16 8d 72 01 89 d0 f0
[87372.120005] Call Trace:
[87372.120005]  [<ffffffff810be6cc>] ? find_get_page+0x3e/0x5f
[87372.120005]  [<ffffffffa016962f>] ? lock_metapage+0xc2/0xc2 [jfs]

[87400.110012] BUG: soft lockup - CPU#0 stuck for 22s! [cp:15356]

Hopefully we can migrate soon in the beginning of next week (@Mistreated ?)

The server currently makes a resync of raid, so expect it to be very slow.

markus2330 on 28 Sep 2019

Jenkins builds are not performed anymore, see #3035

markus2330 on 4 Oct 2019

Jenkins now started again. Please repeat Jenkins build jobs.

markus2330 on 4 Oct 2019

👍1

Looks like v2-debian-buster is offline:

Opening SSH connection to a7.complang.tuwien.ac.at:22221.
Connection refused (Connection refused)

.

sanssecours on 9 Oct 2019

Thank you, I contacted our admin but I am afraid he will be out of office already.

markus2330 on 9 Oct 2019

👍1

Herbert already restarted v2 yesterday. He disabled "simultaneous multithreading".

If a server (v2, i7, a7) crashes again, please also directly contact our admin via "herbert at complang.tuwien.ac.at". Please also report here, to avoid multiple emails.

markus2330 on 10 Oct 2019

I think there is something wrong with the Git repository for the master branch on v2-debian-buster:

git fetch --tags --progress https://github.com/ElektraInitiative/libelektra.git +refs/heads/master:refs/remotes/origin/master +refs/heads/*:refs/remotes/origin/* --prune" returned status code 128:
stdout: 
stderr: error: object file .git/objects/9c/0bc3ca6fcbc610abd845aeff5f666938d24117 is empty
error: object file .git/objects/9c/0bc3ca6fcbc610abd845aeff5f666938d24117 is empty
fatal: loose object 9c0bc3ca6fcbc610abd845aeff5f666938d24117 (stored in .git/objects/9c/0bc3ca6fcbc610abd845aeff5f666938d24117) is corrupt
fatal: the remote end hung up unexpectedly

. I already restarted the build three times, but Jenkins always fails with the same error.

sanssecours on 11 Oct 2019

Unfortunately v2 has btrfs which sometimes seems to corrupt files. A similar problem we already had with a failing docker pull. In the current case, the file 0bc3ca6fcbc610abd845aeff5f666938d24117 seems to be corrupted. When running md5sum on the occurrences of this file, I get:

b9303a311bc8083deb57d9e5c70cde20  ./workspace/libelektra_PR-3038-NAC3HXDHQFTZWU7UCEHHPY5AOGDLHXYBZKKVUYJHDQR3VY4E7S4A@2/.git/objects/9c/0bc3ca6fcbc610abd845aeff5f666938d24117
b9303a311bc8083deb57d9e5c70cde20  ./workspace/libelektra_PR-3038-NAC3HXDHQFTZWU7UCEHHPY5AOGDLHXYBZKKVUYJHDQR3VY4E7S4A@2/libelektra/.git/objects/9c/0bc3ca6fcbc610abd845aeff5f666938d24117
d41d8cd98f00b204e9800998ecf8427e  ./workspace/libelektra_master-Q2SIBK3KE2NBEMJ4WVGJXAXCSCB77DUBUULVLZDKHQEV3WNDXBMA@2/.git/objects/9c/0bc3ca6fcbc610abd845aeff5f666938d24117

I now removed the whole directory for master and restarted the build. See also #3054

markus2330 on 11 Oct 2019

❤1

As I am now not available for a few days, please contact "herbert at complang.tuwien.ac.at" on issues regarding unreachable a7/i7/v2. @Mistreated will be responsible for everything that is not related to rebooting servers. (Hopefully we will get a new build agent soon.)

Please also always report here, to avoid multiple emails and so that everyone has a good overview of what is going on.

markus2330 on 18 Oct 2019

Does the build server currently have a malfunction?
https://build.libelektra.org/jenkins/blue/organizations/jenkins/libelektra/detail/PR-3062/11/pipeline

ghost on 19 Oct 2019

Does the build server currently have a malfunction?

It seems like the main Jenkins build server is unable to connect i7. I marked the node as temporarily offline.

sanssecours on 19 Oct 2019

The build fails on arbitrary cases:
Here it was interrupted without any reason
Here a test case fails which is unrelated to my PR (i just added a design decision w/o touching any code)

ghost on 20 Oct 2019

👍1

Here it was interrupted without any reason

I got the same interruption code 143 for two PRs and cannot explain them yet. I restarted the build and hope that it works now.

Here a test case fails which is unrelated to my PR

This should be fixed thanks to @sanssecours with #3103. Please rebase to master.

dominicjaeger on 21 Oct 2019

The new Jenkins node hetzner-jenkins1 does not seem to work correctly. I marked the node as temporarily offline.

sanssecours on 21 Oct 2019

❤1 👍1

I upgraded docker on i7 and restarted the machine. I hope this fixes the problem. The agent is online again. Please report problems here (and/or deactivate the agent).

Currently a job of #3065 is running on i7.

@Mistreated can you debug hetzner-jenkins1 please?

markus2330 on 24 Oct 2019

Is there a possibility to easily turn off a link check for some time?

This has been happening for the whole day:

doc/tutorials/snippet-sharing-rest-service.md:63:0 http://cppcms.com/wikipp/en/page/apt
doc/tutorials/snippet-sharing-rest-service.md:158:0 http://cppcms.com/wikipp/en/page/cppcms_1x_config
doc/news/2016-12-17_website_release.md:94:0 http://cppcms.com
doc/tutorials/snippet-sharing-rest-service.md:62:0 http://cppcms.com/wikipp/en/page/cppcms_1x_build

Other PRs (most recently #3115, #3113) are affected as well. According to downforeveryoneorjustme links really are not available.

UPDATE: The website is still offline. I made a PR for this #3117.

dominicjaeger on 25 Oct 2019

Is there a possibility to easily turn off a link check for some time?

You can turn off individual links by adding them to tests/linkchecker.whitelist (as you already found out)

markus2330 on 26 Oct 2019

👍2

I cannot rerun failed builds from Cirrus. See https://github.com/ElektraInitiative/libelektra/pull/3113
https://cirrus-ci.com/build/6562476467945472

The button does nothing. Is there a magic trick to get it working?

edit: either someone changed something or the x`th try finally worked. The build is running again! :)

ghost on 27 Oct 2019

Looks like the build agent a7-debian-buster is not able to relaunch:

…
[10/28/19 06:02:59] [SSH] Starting slave process: cd "/home/jenkins" && java  -jar slave.jar
<===[JENKINS REMOTING CAPACITY]===>channel started
Remoting version: 3.25
This is a Unix agent
Evacuated stdout

.

sanssecours on 28 Oct 2019

edit: either someone changed something or the x`th try finally worked. The build is running again! :)

After I saw your comment I did press the button “Restart Failed Build Jobs” too. As far as I can tell pressing the button did indeed restart the failing build jobs.

sanssecours on 28 Oct 2019

After I saw your comment I did press the button “Restart Failed Build Jobs” too. As far as I can tell pressing the button did indeed restart the failing build jobs.

It didnt work for me though, I will provide some gif next time!

ghost on 28 Oct 2019

I'll restart the build server and its nodes. Build jobs of #3121 and #3099 need to be restarted as they had jobs on dead agents.

markus2330 on 28 Oct 2019

❤1

It didnt work for me though, I will provide some gif next time!

You do not need to provided a GIF, since I already believe you 😊.

sanssecours on 28 Oct 2019

Seems like jenkins has troubles to stop, I wait a bit before I forcibly kill all Java processes.

I also upgraded docker on all agents (on i7 it was already upgraded).

markus2330 on 28 Oct 2019

Jenkins is up again with the heartbeat interval as suggested in #2984. All nodes are connected.

Please restart all jobs as needed and report any troubles here.

markus2330 on 28 Oct 2019

v2 is down, I asked our admin to restart.

markus2330 on 31 Oct 2019

❤1 👍1

I enabled v2 again, since it seems to be up and our build backlog is adequately sized.

mpranj on 31 Oct 2019

❤1 👍1

Is there a problem with the build again (hetzner-jenkins1)?
https://build.libelektra.org/jenkins/blue/organizations/jenkins/libelektra/detail/PR-3144/1/pipeline/336

`

ghost on 1 Nov 2019

Is there a problem with the build again (hetzner-jenkins1)?

Yes. I disabled the node.

sanssecours on 1 Nov 2019

It just Disk quota exceeded , I did not want to overkill it with memory. I cleaned it up now. Its up again.

Node updated.

dbulatovic on 1 Nov 2019

❤1 🎉1 👍1

I increased hetzner-jenkins1 to 4 parallel builds. This is only a temporary measure as long as nothing else is running there.

markus2330 on 1 Nov 2019

🚀1 🎉1 👍1

I temporarily decreased the number of executors on hetzner-jenkins1 from 4 to 2, as the testsuite is timing out. I think this happens when too many jobs are compiling while tests are running. We might need to limit resources available to a single container s.t. it does not interfere too much with other jobs.

Feel free to correct it if you think this was the wrong approach.

EDIT: decreased to 1 since tests are still timing out and constantly re-building wastes even more resources.

mpranj on 2 Nov 2019

❤2 👍1

@mpranj Thank you for fixing it!

@mistreated did you maybe only assign a single CPU or similar? Can you assign more and turn the number of executors higher? The hardware should be stronger as v2.

markus2330 on 2 Nov 2019

I disabled i7-debian-buster because there is no disk space left which leads to all builds failing. If someone has access please clean something up and re-enable it.

mpranj on 3 Nov 2019

👍1

@mpranj thank you for disabling!

Sorry, where I am currently are ssh is blocked (some application firewall, also ssh on other ports does not work). So I cannot give access or do any cleanup now.

As i7 is the weakest of the agents, it might not a big deal anyway.

markus2330 on 3 Nov 2019

@Mistreated did you maybe only assign a single CPU or similar? Can you assign more and turn the number of executors higher? The hardware should be stronger as v2.

I have no idea how strong is v2.
Currently jenkins1 uses 4 CPU with 8 memory and 16 swap. I can increase it easily, I just dont know to which point you want me to increase it.

dbulatovic on 3 Nov 2019

A note for the future hardware decisions: phoronix seems to do compilation tests in their CPU articles (e.g. Ryzen 7 3700X, Ryzen 9 3900X test, towards bottom end of article).

Seems like hetzner recently added AMD Ryzen 7 3700X to their AMD based servers.

mpranj on 3 Nov 2019

I have no idea how strong is v2.

@ingwinlu wrote about this in his thesis (to be found in abgaben repo lukas_winkler)

Currently jenkins1 uses 4 CPU with 8 memory and 16 swap. I can increase it easily, I just dont know to which point you want me to increase it.

As long as we do not have anything else running on the server you can allocate all resources. Later we still can go down (when we move Jenkins).

markus2330 on 4 Nov 2019

👍1

I updated the hetzner-jenkins1.
The error with the frontend where it runs out of memory is corrected.
Now it runs 2 parallel builds.

dbulatovic on 4 Nov 2019

❤1 🎉1

Looks like there is no space left on v2-debian-buster:

validation.cpp:69:1: fatal error: error writing to /tmp/cccJFleY.s: No space left on device

.

sanssecours on 5 Nov 2019

👍1

Thank you, I marked it offline as well, until somebody has access to clean it up.

mpranj on 5 Nov 2019

👍2

hetzner-jenkins1 just failed my 3 PRs because disk quota is exceeded. Here is on output:

Starting pull/hub.libelektra.org/build-elektra-alpine:201911-78555f42df1da5d02d2b9bb9c131790fcd98511c3dea33c6d1ecee06b45fae55/ on hetzner-jenkins1
/home/jenkins/workspace/libelektra_PR-3106-LB35J55FSRLFKFEU2WP6AWVLM3IH4JWI6C5B57NWB6DDARN4JDUA@tmp/ff803792-a127-4b8f-8588-439af982c8a4: Disk quota exceeded

dominicjaeger on 5 Nov 2019

👍1

Marked hetzner-jenkins1 as offline because disk quota was exceeded.

mpranj on 5 Nov 2019

👍1

I cleaned up i7 and v2 (by removing /home/jenkins/workspace/* and by running docker system prune). Now we have:

i7: /dev/mapper/i7--vg-home 199G 152G 37G 81% /home
v2: /dev/sda3 417G 255G 147G 64% /

Then I restarted the agents.

@Mistreated can you please fix #3160 so that this does not reoccur so fast. Please also fix hetzner-jenkins1. There are lots of resources on this machine it is really not necessary that it hits a resource limit every day.

markus2330 on 5 Nov 2019

❤2 👍1

I dont know if there is a nice way to resize the disk down. Thats why Im not giving the node everything at once.. hetzner is up again.

dbulatovic on 5 Nov 2019

❤1 👍1

v2 was again out of space, I cleaned up: /dev/sda3 417G 315G 102G 76% /

@Mistreated https://build.libelektra.org/jenkins/computer/hetzner-jenkins1/log does not start up

markus2330 on 7 Nov 2019

I think the build system is currently completely stuck, PRs are not being built.

mpranj on 7 Nov 2019

Thank you for reporting, I will restart Jenkins and cleanup some files as the disc is full.

markus2330 on 7 Nov 2019

❤1

@Mistreated https://build.libelektra.org/jenkins/computer/hetzner-jenkins1/log does not start up

Agent successfully connected and online

I assume everything is fine now with hetnzer-jenkins1?

dbulatovic on 7 Nov 2019

👍1

Thank you so much! Seems like Jenkins is still not reacting to builds. ~~v2 and~~ i7 now ~~both~~ fails with: java.io.IOException: Could not copy slave.jar into '/home/jenkins' on slave.

mpranj on 7 Nov 2019

Jenkins is up again, please restart the jobs.

java.io.IOException: Could not copy slave.jar into '/home/jenkins' on slave.

fixed (was also out of space)

@Mistreated please fix the jenkins-daily as this makes the cleanup tasks we now always need to do manually!

markus2330 on 7 Nov 2019

👍2 ❤1

@Mistreated "jenkins build libelektra please" still is broken, is this related to the changes of the webhooks?

Maybe try today to change to the new Jenkins but if you are not able to do it, please make the old instance work again!

markus2330 on 7 Nov 2019

👍1

I triggered a repository scan, now the "jenkins build libelektra please" seem to work again.

markus2330 on 8 Nov 2019

❤1 👍1

Looks like the disk is full for v2-debian-buster

https://build.libelektra.org/jenkins/blue/organizations/jenkins/libelektra/detail/PR-2976/2/pipeline

ghost on 8 Nov 2019

Unfortunately. I marked v2 as offline until it is resolved.

mpranj on 8 Nov 2019

Thank you for reporting!

@mpranj I gave you access, can you try to cleanup please?

markus2330 on 8 Nov 2019

Thank you! It seems to me that there is an abundance of resources wasted by docker old images lying around. Additionally it seems that btrfs + docker are buggy. Docker creates btrfs subvolumes for each container and does not clean it up propery afterwards. The docker system prune -f command does not free up the space either.

I took v2 and a7 down for maintenance to free the resources and balance the btrfs.

mpranj on 8 Nov 2019

❤3 👍1

docker login failed

Build cant pull docker images. Something going on with docker hub?

dbulatovic on 8 Nov 2019

Yes, sorry, with a7 I also took down the docker hub. I will post a message here when it's up again.

mpranj on 8 Nov 2019

👍1

a7 including docker hub is up again. I left v2 offline because it can not login to the hub to pull images?!? I don't know what's wrong there, I did not change any credentials or so and the other nodes can login. Any ideas?

Btrfs is still balancing in the background, a7 may be slower for another hour or so.

mpranj on 8 Nov 2019

@mpranj thank you for fixing this! Which commands were needed for btrfs re-balancing?

Unfortunately, I do not know the credentials, I hope @ingwinlu can help us out.

markus2330 on 8 Nov 2019

The commands I used to re-balance were:

Fix a maybe-bug with btrfs:

btrfs balance start -dusage=0 -musage=0 /mountpoint

Re-balance the fs really, this takes a long long time. The usage parameter can/should be tuned, this is what worked today:

btrfs balance start -dusage=80 /

The credentials can be changed easily, but we'd have to log in all jenkins agents which connect to the docker hub again with the new credentials.

mpranj on 8 Nov 2019

❤2

The bigger problem was that some docker containers were still running and docker system prune didn't do much. Therefore I took the agents down and freed everything up while it was down. There were TONS of containers just lying around.

mpranj on 8 Nov 2019

Yes, unfortunately the containers are quickly recreated. I hope @Mistreated can fix the libelektra-daily job soon (it executes docker system prune).

markus2330 on 8 Nov 2019

I also did some quite involved digital forensics and stole the hub credentials. :laughing:

v2 is up and running again.

mpranj on 8 Nov 2019

❤1 😄1

Thank you so much! :100: Please send us the credentials.

markus2330 on 8 Nov 2019

Sent. Btw I think a7 is probably slow only because of poor disk speeds, but it's good that it has enough space for the docker hub. Seems like much of the time the CPU is doing nothing there.

Another thought: maybe we can do the critical cleanup jobs additionally per cronjob to avoid situations like we had now.

mpranj on 8 Nov 2019

❤1

Please send us the credentials.

@mpranj I think I was in this group of "us". I wasnt in some CC or something like that?

dbulatovic on 8 Nov 2019

@Mistreated sorry, I sent it to markus and didn't have your email. On a7 you will find CREDENTIALS.txt in your homedir.

mpranj on 8 Nov 2019

❤1 👍1

I need hetzner-jenkins1 Node to test the new jenkins-server. Im gonna turn it off on the old server until tomorow morning.

dbulatovic on 8 Nov 2019

👍3

You can easily create a second hetzner-jenkins2 for tests. If it is only for this night, it should be okay though.

markus2330 on 9 Nov 2019

Another thought: maybe we can do the critical cleanup jobs additionally per cronjob to avoid situations like we had now.

libelektra-daily does this cleanup jobs but it fails now: #3160. If you have ideas to improve this job, please tell us.

I think I was in this group of "us". I wasnt in some CC or something like that?

Yes, sorry I forgot to tell mpranj that "us" refers to you.

markus2330 on 9 Nov 2019

I hope that its ok that I keep hetzner-jenkins1 for a little while, all builds are good now I think i can make the server fully running tonight.

dbulatovic on 9 Nov 2019

v2 is unreachable, I contacted the admin.

mpranj on 9 Nov 2019

❤2

I hope that its ok that I keep hetzner-jenkins1 for a little while, all builds are good now I think i can make the server fully running tonight.

This would be great!

v2 is unreachable, I contacted the admin.

Thank you but I am afraid he will not respond before Monday.

markus2330 on 9 Nov 2019

v2 gets a new kernel (it just crashed).

i7 will also be restarted.

markus2330 on 13 Nov 2019

All 3 servers (v2, a7, i7) now have "Linux v2 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64 GNU/Linux"

They are up and online, please restart jobs if needed.

markus2330 on 13 Nov 2019

👍1

Just a note:
I scanned the repo again with the new server. This could make some errors on the old one..

dbulatovic on 15 Nov 2019

👍1

Seems like master [1] was also built on the new server. It was not successfully. When clicking on the status, a login page appears [2]. Please reconfigure Jenkins that everything can be viewed without being logged in.

Hopefully we can switch to the new Jenkins soon. Seeing errors from two different Jenkins does not make the situation easier :wink:

[1] https://github.com/ElektraInitiative/libelektra/commits/master#
[2] http://95.217.75.163:8080/login?from=%2Fjob%2Flibelektra%2Fjob%2Fmaster%2F1%2Fdisplay%2Fredirect

markus2330 on 15 Nov 2019

@markus2330 can a7 / v2 be rebooted remotely after an upgrade or are there some pitfalls?

mpranj on 17 Nov 2019

Usually it works but if it is not urgent it is better to wait until the admin is there. I can reboot on Tuesday if this is okay for you?

markus2330 on 17 Nov 2019

👍1

Thank you! Nothing urgent, just a general question. It came up because Debian 10.2 was just released. I'll wait a bit with the upgrades.

mpranj on 17 Nov 2019

You can do the upgrade nevertheless (only without reboot). Then, in the case of a crash, we will already have the 10.2 kernel when the admin will press the reset button :wink:

markus2330 on 17 Nov 2019

👍1

@mpranj can you maybe add a cronjob that purges the old snapshots? Or is this not possible without stopping docker?

markus2330 on 17 Nov 2019

https://docs.docker.com/storage/storagedriver/btrfs-driver/ recommends to also rebalance btrnfs in a cronjob.

markus2330 on 17 Nov 2019

can you maybe add a cronjob that purges the old snapshots? Or is this not possible without stopping docker?

I can add a cronjob without stopping all docker containers. This might not clean up everything, but we can try it. Like I said, sometimes containers keep running forever/until the machine crashes. The complete cleanup requires us to temporarily disable the build agent though, then we can force-stop all containers.

I can also add the rebalance as a cronjob.

mpranj on 17 Nov 2019

❤1 🎉1 👍1

Thank you, let us try it.

markus2330 on 17 Nov 2019

Master is out of memory. I wanted to run Scan Repository on the old Jenkins because a7 and i7 get following error on pulling docker images:

docker login failed

I got v2 and hetzner-jenkins1 running on the new server now.

dbulatovic on 18 Nov 2019

👍1

Master is out of memory.

Thanks for reporting. I removed some old coverage data and enabled the master node again. For everyone with open pull requests: Please restart your Jenkins builds with jenkins build libelektra please. Sorry for the inconvenience.

sanssecours on 18 Nov 2019

❤1

In #3234 @raphi011 suggested:

imo this is really urgent, the flakiness and slowness of the tests make it hard if not impossible to do any changes if you have to wait this long to verify them.

I agree it is really urgent but @Mistreated already does what he can.

So maybe we can use the build server more sparingly and only build if we really think the PR is to be merged soon. Unnecessary builds should be canceled.

Or what about (temporarily) stop automatic building of PRs on pushing of changes (so the build starts only with jenkins build libelektra please)? @Mistreated do you know how to reconfigure Jenkins to do so (I did not find the option)?

markus2330 on 18 Nov 2019

I also have the feeling that jenkins build libelektra please does not work at the moment, at least it didn't work for this build: https://github.com/ElektraInitiative/libelektra/pull/3073 i had to push an empty commit to start the pipeline.

raphi011 on 18 Nov 2019

👀1

Cleanup cronjob implemented and backport kernel upgraded on a7 and v2. There's quite a changelog for the kernel, it will be active on next reboot.

mpranj on 18 Nov 2019

❤1

Thank you very much! Is the "old" backport kernel still installed so that we have a fallback if it doesn't boot?

markus2330 on 18 Nov 2019

Yes, it can be removed after a successful boot into the new kernel.

mpranj on 18 Nov 2019

👍1

Starting pull/hub.libelektra.org/build-elektra-alpine:201911-78555f42df1da5d02d2b9bb9c131790fcd98511c3dea33c6d1ecee06b45fae55/ on i7-debian-buster

docker login failed

https://build.libelektra.org/jenkins/blue/organizations/jenkins/libelektra/detail/PR-3244/1/pipeline

ghost on 18 Nov 2019

👀1 👍1

I disabled i7 for a manual cleanup, kernel and docker upgrade. Somebody enabled i7 while I was working on it. Everything is up and running again.

@Piankero I restarted your build now.

mpranj on 18 Nov 2019

❤2

I also have the feeling that jenkins build libelektra please does not work at the moment, at least it didn't work for this build: #3073 i had to push an empty commit to start the pipeline.

works now

raphi011 on 18 Nov 2019

❤1 🎉1

@Mistreated do you know how to reconfigure Jenkins to do so (I did not find the option)?

I added the following to Jenkins Configuration:

Suppress automatic SCM triggering

dbulatovic on 18 Nov 2019

❤2

Note to everyone: Use of "jenkins build libelektra please" is now mandatory, build jobs do not start by simply pushing. We will inform here, when we revert this setting.

@Mistreated thank you! Let us see if this enough. I hope pushes to master still trigger the master builds?

Having the hetzner node would be very good, nevertheless. Are there any problems if the node is used by two build servers at the same time? Or if it is a problem: isn't it very easy to simply clone the CT?

markus2330 on 18 Nov 2019

@Mistreated thank you! Let us see if this enough. I hope pushes to master still trigger the master builds?

master branch is now an exception of the following rule:

Suppress automatic SCM triggering

As for

Having the hetzner node would be very good, nevertheless. Are there any problems if the node is used by two build servers at the same time? Or if it is a problem: isn't it very easy to simply clone the CT?

I added a new CT (hetzner-jenkinsNode3).

dbulatovic on 18 Nov 2019

❤3 🚀2 👍2

could not clone the repo: https://build.libelektra.org/jenkins/blue/organizations/jenkins/libelektra/detail/PR-3073/8/pipeline/634/

maybe this has something to do with the new node? (wild guess)

raphi011 on 18 Nov 2019

👀1 👍1

maybe this has something to do with the new node? (wild guess)

this error is on a7

dbulatovic on 18 Nov 2019

this error is on a7

soooo .. retry?

raphi011 on 18 Nov 2019

soooo .. retry?

yeah, I dont think it will happen again..

I will rerun it for you.

dbulatovic on 18 Nov 2019

❤2 🚀1

@Mistreated I think we can start automatic builds again. But please look at #3160 first.

markus2330 on 19 Nov 2019

any idea why this would fail?

go: github.com/google/[email protected]: Get https://proxy.golang.org/github.com/google/uuid/@v/v1.1.1.mod: net/http: TLS handshake timeout

https://build.libelektra.org/jenkins/blue/organizations/jenkins/libelektra/detail/PR-2827/8/pipeline/648

raphi011 on 19 Nov 2019

any idea why this would fail?

Sounds like it was a temporary problem, the URL is currently accessible from the build agents.

In the long run it would be great to set up those kind of dependencies in the docker images, to avoid downloading them for each build repeatedly. That should also prevent build failures due to temporarily unavailable packages like you mentioned above.

mpranj on 19 Nov 2019

Well then.. jenkins build libelektra please the third

raphi011 on 19 Nov 2019

😕1

Yes, we have all dependencies directly in the docker images exactly for this reason. I created #3251

markus2330 on 19 Nov 2019

I took v2 and a7 offline for rebooting.

markus2330 on 19 Nov 2019

@markus2330 if you get a chance, enable hyperthreading on a7.

mpranj on 19 Nov 2019

v2 is up again, on a7 there is still a buildjob.

markus2330 on 19 Nov 2019

I took nodes and added them to the new server. I am gonna let it run over night. Tomorow I will return the nodes if there are further errors on new Jenkins server.

hetzner-jenkinsNode3 will still run on the old Jenkins.

dbulatovic on 20 Nov 2019

Tomorow I will return the nodes if there are further errors on new Jenkins server.

Small build errors are not a reason to switch back. At some point we need to fix the errors, the going back and forth is very time consuming.

What might be a show-stopper, however, is that the new server is not reachable. (Neither
http://95.217.75.163:8066 nor ssh). I pressed the power button, let us see if the machine restarts. We should investigate what was the problem, though.

markus2330 on 20 Nov 2019

👀2

http://95.217.75.163:8066

If there is time please enable TLS using letsencrypt, so we don't leak credentials and expose ourselves to various other problems?

mpranj on 20 Nov 2019

👍1

Thank you for the input! I would suggest we do this immediately when we switch build.libelektra.org. Otherwise we have double-efforts.

markus2330 on 20 Nov 2019

👀1 👍1

Is this error known? Caught the following exception: null

raphi011 on 20 Nov 2019

😕1

Looks like the formatting check failed and the other builds were terminated.

mpranj on 20 Nov 2019

👍2

you're right. what a crappy error message though :P

raphi011 on 20 Nov 2019

👍1

@Mistreated can you please again activate that PRs are automatically built? Due to the many agents the server is now sleeping most of the time.

markus2330 on 21 Nov 2019

👍1

@Mistreated can you please again activate that PRs are automatically built? Due to the many agents the server is now sleeping most of the time.

Done.

dbulatovic on 21 Nov 2019

🚀2 ❤2

I am gonna borrow hetzner-jenkins1 and v2 again for the new server.

dbulatovic on 22 Nov 2019

🚀1 👍1

You do not need to give it back, I hope we can do the switch today.

markus2330 on 22 Nov 2019

👍1

Tip: when doing these kind of switches, it's good to decrease the TTL of the DNS entry to something unusually low (e.g. 60 instead of currently 21599 for build.libelektra.org). After the change is propagated it should let us switch the DNS entry within a minute instead of hours. If it's too late it can help to clear DNS caches of google and opendns, but some people will inevitably see the old resource until the cached entries expire globally.

EDIT: after the change the TTL should obviously be reverted to some sane value to put less load on the DNS.

mpranj on 22 Nov 2019

👀1

Even though it is now maybe too late, I switched $TTL 3600 (if we need several changes until everything works).

www-new and build-new already exists pointing to the new server.

I now switched doc.libelektra.org. @Mistreated will fix the publishing. I will look into www-new.libelektra.org

markus2330 on 22 Nov 2019

❤1

https://build-new.libelektra.org/ and https://www-new.libelektra.org/home should work now.

I'll change all DNS entries now.

markus2330 on 22 Nov 2019

All DNS entries are changed.

Unfortunately, certbot fails as it seems to speak with the old server but this seems to only affect download and community (lesser used URLs).

So hopefully, during/after the weekend everyone sees the updated DNS names.

@Mistreated please update the publishing of all artifacts: also for the website. Please create a PR to make sure everything is working properly.

markus2330 on 22 Nov 2019

The old build server is now shut down.

markus2330 on 23 Nov 2019

I need to restart the new server (new kernel and network bridge added).

Server is up again with Linux pve 5.0.21-5-pve.

I scheduled a rescan of all PRs.

markus2330 on 23 Nov 2019

server offline due to misconfiguration/bug in pve (/etc/network/interfaces was deleted by GUI?).

Bug was that renaming of network devices (which was caused by my action in the GUI) lead to a kernel OOPS:

Nov 23 21:32:08 pve kernel: [ 1682.138250] veth4d0199f: renamed from eth0
Nov 23 21:32:19 pve kernel: [ 1693.378374]  __x64_sys_newlstat+0x16/0x20
Nov 23 21:32:19 pve kernel: [ 1693.378380] Code: Bad RIP value.
Nov 23 21:32:19 pve kernel: [ 1693.378382] RDX: 00007fa58b238e20 RSI: 00007fa58b238e20 RDI: 00007fa58ba50d24
Nov 23 21:32:19 pve kernel: [ 1693.378383] R13: 0000000000000294 R14: 00007fa58ba50cc8 R15: 00007ffe65c2b158
Nov 23 21:34:20 pve kernel: [ 1814.210370]  request_wait_answer+0x133/0x210
Nov 23 21:34:20 pve kernel: [ 1814.210374]  fuse_simple_request+0xdd/0x1a0
Nov 23 21:34:20 pve kernel: [ 1814.210378]  ? fuse_permission+0xcf/0x150
Nov 23 21:34:20 pve kernel: [ 1814.210381]  path_lookupat.isra.47+0x6d/0x220
Nov 23 21:34:20 pve kernel: [ 1814.210385]  ? strncpy_from_user+0x57/0x1c0
Nov 23 21:34:20 pve kernel: [ 1814.210388]  __do_sys_newlstat+0x3d/0x70
Nov 23 21:34:20 pve kernel: [ 1814.210392]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

markus2330 on 23 Nov 2019

Server should be up and running again.

Previous problems remain, though (phrase not working #3268)

markus2330 on 23 Nov 2019

@Mistreated master also does not seem to build automatically anymore, I trigger it manually now.

I am collecting urgent errors in #3268. It would be good if you can also test if everything works as described in doc/BUILDSERVER.md.

markus2330 on 24 Nov 2019

jenkins does not build without any given reason:
https://github.com/ElektraInitiative/libelektra/pull/3286
https://build.libelektra.org/blue/organizations/jenkins/libelektra/detail/PR-3286/3/pipeline

ghost on 27 Nov 2019

😕1

Another build error: https://build.libelektra.org/blue/organizations/jenkins/libelektra/detail/PR-2827/13/pipeline

raphi011 on 28 Nov 2019

Thank you for reporting!

@Mistreated could you please install Naginator+Plugin #2967 as already discussed many times? (Please make a snapshot before changes to Jenkins.)

markus2330 on 28 Nov 2019

hetzner-jenkins1: disk quota exceeded

mpranj on 29 Nov 2019

@Mistreated could you please install Naginator+Plugin #2967 as already discussed many times? (Please make a snapshot before changes to Jenkins.)

Gonna do it today

hetzner-jenkins1: disk quota exceeded

@mpranj I added the new VM as a build agent while hetzner-jenkins1 is down.

dbulatovic on 29 Nov 2019

I cleaned up some space on hetzner-jenkins1 by running docker system prune -a and enabled it again.

Seems like there is again a problem that lots of stuff is not cleaned up by docker system prune -f. This time the storage driver was not btrfs but vfs. :confused:

mpranj on 29 Nov 2019

I added the new VM as a build agent while hetzner-jenkins1 is down.

The idea now is that we do not use the container anymore but only the VM instead.

I cleaned up some space on hetzner-jenkins1 by running docker system prune -a and enabled it again.

Thank you very much! Can you also make a cronjob there? (on the VM, not on the container).

markus2330 on 29 Nov 2019

make a cronjob there?

Done.

mpranj on 29 Nov 2019

@Mistreated could you please install Naginator+Plugin #2967 as already discussed many times? (Please make a snapshot before changes to Jenkins.)

We have to move from pipeline to freestyle job, if we want the naginator plugin. I am gonna look for alternatives.

dbulatovic on 29 Nov 2019

Builds on the VM jenkinsNode3VM currently fail. They are unable to docker pull:

unexpected EOF
script returned exit code 1

I disabled it for now until someone can fix the problem.

mpranj on 29 Nov 2019

[cronjob] Done.

Thank you!

We have to move from pipeline to freestyle job, if we want the naginator plugin. I am gonna look for alternatives.

Yes, good idea. Maybe it is best to simply code that in our Jenkinsfile. So if a problematic build job/stage fails, that it is retried. That these docker pulls are tried at least twice as this is one of the most frequent problem.

Builds on the VM jenkinsNode3VM currently fail. They are unable to docker pull:

@Mistreated please fix this.

markus2330 on 29 Nov 2019

Builds on the VM jenkinsNode3VM currently fail. They are unable to docker pull

I fixed the docker image that could not be pulled.

dbulatovic on 30 Nov 2019

🚀1 ❤1

Thank you very much! It is always helpful if you write what was wrong and how you fixed it.

markus2330 on 30 Nov 2019

I have no idea what was wrong, I built the image manually on the agent. Since the agent could not pull it.

As for Dockerfile (scripts/docker/debian/stretch) my Visual Code says it has 2 emtpy lines on the end, but vim says its only one. I dunno does it have to do something with mistake above, maybe its just my VS.

dbulatovic on 30 Nov 2019

😕1

Seems like we have problems with our docker registry (#3316 docker pull fails with unexpected EOF).

Since the dust after the release has settled and there are no builds going on I would suggest stopping everything and trying to clear the registry completely. After that all images should be re-built hopefully cleanly. I would backup the registry data before I start just to make sure, but I hope that a clean start gets rid of some errors we were having.

I'll wait for comment whether there's anything against that before I start.

mpranj on 3 Dec 2019

I think its a problem in the

(scripts/docker/debian/stretch)

image, since it is the only one failing.

I've built it manually again, but there certainly is something wrong with the image in the registry.

dbulatovic on 3 Dec 2019

Jenkins reports: jenkinsNode3VM (offline)

@Mistreated would be great if you could setup some way of monitoring.

markus2330 on 8 Dec 2019

a7 (and therefore v2 and i7) are down. I contacted the admin.

EDIT markus2330: up again

mpranj on 20 Jan 2020

❤1 👍1

Because of a planned power outage on 08.07.2020 at TU Wien, our admin plans to shutdown all build servers on the day before (Tuesday 7.7). Builds will be very slow during that time, so please only push on that day in urgent cases.

markus2330 on 22 Jun 2020

👍1

Servers are again online except of i7. I notified the admin.

markus2330 on 8 Jul 2020

❤1

There was another power outage (unplanned) yesterday in the evening, thus all the build servers are down now. The admin is working on it.

EDIT 30 min later: everything, including i7, is up again :rocket:

markus2330 on 10 Jul 2020

❤1

@markus2330 it seems that v2 and i7 have lost internet access recently (possibly during the power outage?). Are you aware of any config changes we should have made, since the interfaces are configured statically?

mpranj on 20 Jul 2020

I do not know about any change, only about that the computers were turned on again after these two power outages (one planned, the other unplanned).

But you are right, I also see that they these two (but not a7) do not have Internet connection anymore, although they are reachable. I asked our admin about that.

Maybe we disconnect them from Jenkins until this problem is fixed?

markus2330 on 20 Jul 2020

👍1

Thanks for contacting the admin! I disconnected both i7 and v2 until the problem can be resolved. (Builds don't work anyway because they can not pull docker images)

mpranj on 20 Jul 2020

❤1

Someone changed something with the router about one week ago. The person was notified and hopefully will fix it soon.

Let's keep them disconnected for now.

markus2330 on 20 Jul 2020

👍1

The Internet problem is resolved now and I also installed the security updates on these machines.

@mpranj can you please switch them on again?

markus2330 on 27 Jul 2020

👍1

Thank you @markus2330.

All nodes are now back online.

mpranj on 28 Jul 2020

❤1

I rebooted the build server for the latest PVE kernel. Jenkins should be up shortly.

markus2330 on 29 Sep 2020

🚀1

I moved

[ ] make sending emails if build fail more reliable
[ ] Docker image without Jenkins user
[ ] centOS/fedora/arch docker images
[ ] centOS packages
[ ] freebsd/openbsd/solaris build agents

to #3519 and linked to #3519 above.

markus2330 on 21 Oct 2020

@robaerd now also has access to a7/v2/i7 and can contact the admin in the case of troubles.

markus2330 on 29 Oct 2020

❤1

Just a quick report of build times (main pipeline libelektra):

with a7 enabled: 2h 29m 24s
with a7 disabled: 1h 35m 45s

mpranj on 31 Oct 2020

❤1 👍1

Why does Jenkins display that it gets shut down? Would be good to always read here in advance :wink:

markus2330 on 3 Nov 2020

Jenkins is going to be shut down, because of a full system backup and reformat of the filesystems from btrfs to ext4.

mpranj on 3 Nov 2020

Jenkins is up again

mpranj on 3 Nov 2020

🚀1

Jenkins CI will be offline for maintenance from around 11:15 CET today.

We will perform some backup and cleanup tasks and try to improve performance of a7.

You will be notified again when the maintenance is over.

mpranj on 6 Nov 2020

❤1

Jenkins CI and all build servers are up again. a7 should perform a lot better now, but has less storage capacity.

Please report if you experience any errors.

mpranj on 6 Nov 2020

🚀1 👍1

~~Jenkins CI and the build agents will be offline for a short maintenance/updates. Should be done in a matter of minutes.~~

EDIT: updates are done.

mpranj on 9 Dec 2020

❤1

The server is down, I investigate.

markus2330 on 21 Dec 2020

The server is up again.

Official statement about the cause: "There was an issue with the PSU in the adjacent server which have caused the server to shut down; it has now been corrected."

markus2330 on 21 Dec 2020

The ssd in a7 is full, causing all builds on it to fail.

I will try to free up some space. Is is safe to disconnect the a7 build agent from jenkins for now?

robaerd on 4 Jan 2021

❤1

Thank you for looking into it!

I will try to free up some space. Is is safe to disconnect the a7 build agent from jenkins for now?

Of course. On the contrary, it would be unsafe to keep it connected if it makes all builds fail.

markus2330 on 4 Jan 2021

Running docker system prune -a cleaned up roughly 50% of the space again. Maybe we need to adapt the existing cronjob to add the -a flag?

The jenkins home is using lots of space as well.

mpranj on 4 Jan 2021

👍1

The master builds (full pipeline with deb packages and website building) is somehow still failing, even though everything looks green. Any ideas?

mpranj on 5 Jan 2021

The upload of the focal deb packages to community fails on the file elektra_0.9.3.orig.tar.gz. It's probably a permission issue on the file. I will remove it from the directory for now and let it be recreated in the next run.

Somehow the sshPublisher if it fails doesn't set the stage to red.

robaerd on 5 Jan 2021

❤2

Maybe we need to adapt the existing cronjob to add the -a flag?

Was there any reason why you didn't do it like this? If not, sounds like a good idea.

markus2330 on 5 Jan 2021

~~Jenkins CI and the registry on a7 will be offline for the migration of the watchtower image to a new version. v2 will also be cleaned since the disk is full. Should only take a couple of minutes.~~

EDIT: Updates are done and Jenkins CI is up again

robaerd on 11 Jan 2021

❤2

~~Jenkins and the agents will be down briefly for an update. Everything should be up again in a couple of minutes and this post will be updated/edited.~~

EDIT: everything was updated and is up&running again. I needed to correct a problem where a7 was using Debian stretch docker packages instead of buster. I also cleaned up some space.

mpranj on 22 Jan 2021

❤1

Builds are failing as there is no space on a7.

mpranj on 31 Jan 2021

😕1

~~Build infrastructure will be unavailable for a few minutes for maintenance. Everything should be up&running again within 10 minutes.~~

Build infrastructure is available again.

mpranj on 5 Feb 2021

❤1

Libelektra: Build Server stuff

Most helpful comment

All 585 comments

* Set # of build to keep to 30 for all pipelines as according to multiple sources those get read when accessing the webui and thus a large number of old builds slows down requests

* deprioritised all agents running on mr

Related issues