Libseccomp: BUG: gen_bpf_generate() doesn't properly handle failure

Created on 28 May 2020  ·  15Comments  ·  Source: seccomp/libseccomp

Hi,

First off, thanks for libseccomp -- we've been happily using it in production for several years now, and haven't run into any issues (until now). I'm not sure if this is a bug in our code, a misunderstanding of the documentation, or something else -- but I've spent the past month trying to track this down to no avail.

We recently upgraded packages in our Docker containers, which included an upgrade from libseccomp 2.3.3 (version in Debian stable repos) to 2.4.3. There were other system packages that also got upgraded, but I did not record them. Our kernel was not upgraded, and is version 4.19.0-8-amd64.

We use SCMP_ACT_TRACE, and build a filter consisting only of SCMP_ACT_ALLOW rules that are added using native syscall numbers, rather than libseccomp's pseudo-numbers. We fork off a 64-bit helper process that builds and loads the seccomp filter before exec-ing another 64-bit binary.

For reference, this is the entirety of our seccomp initialization routine, using similar error-checking to the seccomp_rule_add man page.

However, our call to seccomp_load has started returning -EINVAL, on the order of magnitude of 1 / 100,000 process initializations. (Not being able to reliably reproduce it has made this tedious to debug.) There were no code changes to our application in during this time. The syscalls added to the filter are identical across all runs.

Any ideas on what could be going wrong (or even how to dig further into what's going wrong), or if this is expected in some way? There aren't a whole lot of dynamic moving parts, and I couldn't find anything in the documentation about why this might be happening.

bug prioritmedium

Most helpful comment

Not yet, unfortunately. After adding in a patch to seccomp_export_pfc, it's been silent. Yesterday I pushed that patch to all our VMs (rather than just a test one) in hopes of capturing the issue when it does eventually occur.

I do find the silence odd, but for now I'm chalking it up to coincidence since all the debugging/exporting logic happens _after_ failing seccomp_load, so it shouldn't be affecting the failure itself.

All 15 comments

Hi @Xyene,

There aren't a lot of places that return -EINVAL in the seccomp_load() code path. Based on a quick examination of the libseccomp v2.4.3 code it looks like it is either due to an invalid scmp_filter_ctx or the kernel complaining about the prctl(...) call which loads the filter.

Considering that v2.4.3 generally works, and you haven't changed your kernel, it seems doubtful that the prctl(...) call is the cause which leads us to an invalid filter context. Have you noticed any other odd behavior in your program since the upgrade? I'm wondering if there is a memory corruption problem elsewhere that is causing the problem.

While it's always possible the fault is with libseccomp, we do run each release through a series of checks which include valgrind runs for all of our regression tests as well as static analysis using both clang and Coverity.

Also, while this doesn't help for v2.4.3, one of the improvements we are targeting for the almost ready v2.5.0 release is improved documentation and handling of error codes.

We recently upgraded packages in our Docker containers, which included an upgrade from libseccomp 2.3.3 (version in Debian stable repos) to 2.4.3. There were other system packages that also got upgraded, but I did not record them. Our kernel was not upgraded, and is version 4.19.0-8-amd64.

Thank you for verifying that your code and the underlying kernel haven't changed. That should help to track down the problem.

For reference, this is the entirety of our seccomp initialization routine, using similar error-checking to the seccomp_rule_add man page.

Your filter looks reasonable to me.

Any ideas on what could be going wrong (or even how to dig further into what's going wrong), or if this is expected in some way? There aren't a whole lot of dynamic moving parts, and I couldn't find anything in the documentation about why this might be happening.

I looked through the v2.4.3 seccomp_load() code, and I think there are only two places where libseccomp generates a return code of -EINVAL:

Both of the above errors are effectively caused by an invalid filter. That seems unlikely to me based upon your filter code.

It's worth noting that the kernel's default return value in seccomp_set_mode_filter() is -EINVAL, so it's possible that something else on the system changed, leading us to fall down that path. You mention that you're running in Docker; are you disabling the default Docker seccomp filter?

I would be tempted to add some more debugging to your code inside the if after seccomp_load() fails. For example, we could output the PFC and/or the BPF of the filter itself to verify it looks reasonable. See seccomp_export_pfc() and seccomp_export_bpf().

I looked through the v2.4.3 seccomp_load() code, and I think there are only two places where libseccomp generates a return code of -EINVAL:

Keep in mind that any failures found in gen_bpf_generate(...), or below, are effectively combined into -ENOMEM by sys_filter_load(...) on src/system.c:267.

I hate falling back to "memory corruption!" so quickly, but it looks like that may be the case here.

Thanks for the quick and detailed replies! They've generated several avenues of exploration :slightly_smiling_face:

Have you noticed any other odd behavior in your program since the upgrade? I'm wondering if there is a memory corruption problem elsewhere that is causing the problem.

No, just this. Our unit and integration tests continue to pass, and apart from this very rare EINVAL, no errors are being logged in prod. This certainly makes it puzzling; I also suspected memory corruption, but haven't been able to find any evidence to support it :slightly_frowning_face:

For a bit more context:

  • the program is a Python app calling into some C++ via Cython (the GIL is held during this time, so Python isn't making allocs on other threads)
  • the C++ side forks, and in the child sets up the seccomp filter prior to exec
  • all the memory allocs that happen post-fork, pre-exec are from libseccomp itself, in seccomp_init etc.
  • there are precisely 4 array mallocs between calling into C++ code and forking off, they're all in Cython and have appropriate ranges when writing to them (line 435 calls the previously-linked seccomp code) -- all other allocations/writes/etc. are within the Python interpreter

While typing this up I had an idea: I have heard horror stories about malloc being unsafe to use after forking, and we do have some within libseccomp itself. The Python app itself _is_ multithreaded, but we always hold the GIL while in native code so this should be safe(?). I've only heard about deadlocks happening through malloc-after-fork, though. (I guess this makes the next order of business moving seccomp_init et al. before the fork, only calling seccomp_load post-fork, and seeing if the errors keep happening.)

I would be tempted to add some more debugging to your code inside the if after seccomp_load() fails.

Thanks for the suggestion! I've added a call to seccomp_export_pfc, as well as dumping the contents of the input to the filter (config->syscall_whitelist). I'll follow-up the next time this fails.

Hi @Xyene - since it's been about a week, I just wanted to check and see if there was anything new that you've found?

Not yet, unfortunately. After adding in a patch to seccomp_export_pfc, it's been silent. Yesterday I pushed that patch to all our VMs (rather than just a test one) in hopes of capturing the issue when it does eventually occur.

I do find the silence odd, but for now I'm chalking it up to coincidence since all the debugging/exporting logic happens _after_ failing seccomp_load, so it shouldn't be affecting the failure itself.

Progress!

Turns out the reason it's been silent is because seccomp_export_bpf was segfaulting (should it, if called after seccomp_load?), and that was being reported elsewhere and not where I was looking for seccomp failures. More importantly, I've run into a case where I can reliably reproduce the issue in ~150 invocations, so with some plumbing work I should be able to extract some core dumps.

Alright, I pulled a coredump out, and this was the trace: https://gist.github.com/Xyene/920f1cb098784a031f53c66a2f49d167

This was a bit suspect, since it's crashing inside jemalloc's realloc routine. Furthermore, using glibc malloc instead resolves the issue (unfortunately, it's not a long-term option in this case due to fragmentation issues).

Next, I pulled in jemalloc, compiled it with -O0 and debugging symbols, and reran the reproduction. This time it crashed in seccomp_load, rather than after! I've uploaded that trace here: https://gist.github.com/Xyene/5da56168bcea337da85b2cd30704d12e

A snippet of that trace:

#9  0x00007ff962698495 in free (ptr=0x5a5a5a5a5a5a5a5a) at src/jemalloc.c:2867
No locals.
#10 0x00007ff96062d087 in _program_free (prg=prg@entry=0x7ff95e963010) at gen_bpf.c:511
No locals.
#11 0x00007ff96062f605 in gen_bpf_release (program=program@entry=0x7ff95e963010) at gen_bpf.c:1986
No locals.
#12 0x00007ff96062c04f in sys_filter_load (col=col@entry=0x7ff95e9a5000) at system.c:293
        rc = -1
        prgm = 0x7ff95e963010
#13 0x00007ff96062b666 in seccomp_load (ctx=ctx@entry=0x7ff95e9a5000) at api.c:286
        col = 0x7ff95e9a5000

Searching jemalloc, it looks like 0x5a is used to mark free bytes as free, with the specific intent of crashing code that is trying to free something that already has been freed.

gen_bpf.c:511 in v2.4.3 is: https://github.com/seccomp/libseccomp/blob/1dde9d94e0848e12da20602ca38032b91d521427/src/gen_bpf.c#L505-L513

But, this doesn't make much sense since the program's lifetime is just the body of sys_filter_load:

https://github.com/seccomp/libseccomp/blob/1dde9d94e0848e12da20602ca38032b91d521427/src/system.c#L260-L296

I think I've spotted at least one issue. In gen_bpf_generate;

https://github.com/seccomp/libseccomp/blob/1dde9d94e0848e12da20602ca38032b91d521427/src/gen_bpf.c#L1963-L1966

state.bpf = prgm so long as zmalloc didn't fail. Next, _gen_bpf_build_bpf is called, and based off its rc, state.bpf is set to NULL.

https://github.com/seccomp/libseccomp/blob/1dde9d94e0848e12da20602ca38032b91d521427/src/gen_bpf.c#L1968-L1971

Considering the case where rc != 0, state.bpf is still set to prgm at the time of the call to _state_release. This will cause the memory pointed to by prgm to be freed.

https://github.com/seccomp/libseccomp/blob/1dde9d94e0848e12da20602ca38032b91d521427/src/gen_bpf.c#L539

Next, gen_bpf_generate will return prgm, which despite having been freed, is still a nonzero pointer.

https://github.com/seccomp/libseccomp/blob/1dde9d94e0848e12da20602ca38032b91d521427/src/gen_bpf.c#L1971-L1974

Back in sys_filter_load, gen_bpf_generate returns, and prgm is not-NULL so it continues.

https://github.com/seccomp/libseccomp/blob/1dde9d94e0848e12da20602ca38032b91d521427/src/system.c#L265-L267

Finally, at the end of sys_filter_load, gen_bpf_release is called on the already-free prgm.

https://github.com/seccomp/libseccomp/blob/1dde9d94e0848e12da20602ca38032b91d521427/src/system.c#L292-L295

This doesn't address the concern of why _gen_bpf_build_bpf would fail in the first place, but does seem like something bad that could happen if it did.

Edit: actually, this looks like it probably was fixed as a side-effect of https://github.com/seccomp/libseccomp/commit/3a1d1c977065f204b96293cccfe7d3e5aa0d7ace.

Considering the case where rc != 0, state.bpf is still set to prgm at the time of the call to _state_release. This will cause the memory pointed to by prgm to be freed.

Ah ha! Good catch @Xyene!

I think we need to fix this beyond 3a1d1c977065f204b96293cccfe7d3e5aa0d7ace, let me think on this for a minute ... I don't think the fix will be too hard ... and see if I can come up with a PR.

I think we need to fix this beyond 3a1d1c9, let me think on this for a minute ... I don't think the fix will be too hard ... and see if I can come up with a PR.

Ooops, I was looking at old code when I wrote that; yes, I believe that 3a1d1c9 does fix this for us, but we will need a patch for the release-2.4 branch. I'll work on that now.

_(Meta: I'm going to continue updating this message with my findings as I go along, so I have somewhere to write them down without email spamming you guys :)_

Alright, back on 2.4.3 with the patch applied, I've been able to pull out the filter that was failing: link.

The reported cause is now ENOMEM instead of EINVAL, which I guess is expected given that _gen_bpf_build_bpf is failing and returning a NULL program. The PFC prints out fine, though. Modifying seccomp code to report the return value of _gen_bpf_build_bpf shows EFAULT as the cause.

As a quick hack I ran :%s/return -EFAULT/abort() over src/gen_bpf.c, and was able to extract this stack trace:

EFAULT stacktrace

(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {0, 140084028365964, 140083248439464, 140083248438968, 140083248431088, 140084028368143, 28659884033, 140083965300736, 
            140083248439464, 140083248438968, 140083248431088, 140084028351031, 140084019988760, 140083248439624, 140083248431200, 140084028372597}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x00007f67daa4d55b in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x7f67d6f3eec0, sa_sigaction = 0x7f67d6f3eec0}, sa_mask = {__val = {140083965300736, 
              140083965300736, 0, 0, 140083248438968, 140083248438968, 140083248439464, 140083248431504, 140084028417173, 140083964793344, 
              140083965300736, 140083248431552, 140083994791895, 140083248431552, 140083994787642, 140083965300736}}, sa_flags = -1404894496, 
          sa_restorer = 0x0}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007f67d8bfd455 in _gen_bpf_build_bpf (state=0x7f67ac4302e0, col=0x7f67d6f63040) at gen_bpf.c:1943
        rc = 0
        iter = 1
        h_val = 1425818561
        res_cnt = 0
        jmp_len = 0
        arch_x86_64 = 0
        arch_x32 = -1
        instr = {op = 32, jt = {tgt = {imm_j = 0 '\000', imm_k = 0, hash = 0, db = 0x0, blk = 0x0, nxt = 0}, type = TGT_NONE}, jf = {tgt = {
              imm_j = 0 '\000', imm_k = 0, hash = 0, db = 0x0, blk = 0x0, nxt = 0}, type = TGT_NONE}, k = {tgt = {imm_j = 4 '\004', imm_k = 4, 
              hash = 4, db = 0x4, blk = 0x4, nxt = 4}, type = TGT_K}}
        i_iter = 0x7f67d6fdcb60
        b_badarch = 0x7f67d6fd9000
        b_default = 0x7f67d6fd9060
        b_head = 0x7f67d6fda1a0
        b_tail = 0x7f67d6fd9000
        b_iter = 0x0
        b_new = 0x7f67d6fe3300
        b_jmp = 0x0
        db_secondary = 0x0
        pseudo_arch = {token = 0, token_bpf = 0, size = ARCH_SIZE_UNSPEC, endian = ARCH_ENDIAN_LITTLE, syscall_resolve_name = 0x0, 
          syscall_resolve_num = 0x0, syscall_rewrite = 0x0, rule_add = 0x0}
#3  0x00007f67d8bfd560 in gen_bpf_generate (col=0x7f67d6f63040) at gen_bpf.c:1971
        rc = 0
        state = {htbl = {0x0 <repeats 256 times>}, attr = 0x7f67d6f63044, bad_arch_hsh = 889798935, def_hsh = 742199527, arch = 0x7f67ac4301e0, 
          bpf = 0x7f67d6f64010}
        prgm = 0x7f67d6f64010
#4  0x00007f67d8bf64a7 in sys_filter_load (col=0x7f67d6f63040) at system.c:265
        rc = 32615
        prgm = 0x0
#5  0x00007f67d8bf4f10 in seccomp_load (ctx=0x7f67d6f63040) at api.c:287
        col = 0x7f67d6f63040

That corresponds with line 1943:

https://github.com/seccomp/libseccomp/blob/1dde9d94e0848e12da20602ca38032b91d521427/src/gen_bpf.c#L1935-L1943

Given the nature of the replacement, I think we can exclude any EFAULT in any helper function, since those would have aborted first.

After this, I tried reproducing the same with HEAD -- it still happens. Next, %s:/goto build_bpf_free_blks/abort() and repeat. The cause was:

https://github.com/seccomp/libseccomp/blob/34bf78abc9567b66c72dbe67e7f243072162a25f/src/gen_bpf.c#L2219-L2220

Thankfully this function was short, and had only a handful of failure points. Another round of abort insertions later;

Trace

(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {0, 140050183343588, 0, 448, 140049402494880, 140049402509040, 140049402494832, 140050183342988, 140049402495088, 
            140049402509040, 140049402494896, 140050183343588, 4294967296, 140049402509040, 140049402509040, 140049402509040}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x00007f5ff953055b in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x7f5ff595d260, sa_sigaction = 0x7f5ff595d260}, sa_mask = {__val = {139642271694862, 
              140050119389792, 0, 0, 140049402502840, 0, 140049402503336, 140049402502888, 140049402502840, 112, 384, 140049402502840, 140050149861504, 
              140049402495328, 140050149857273, 392}}, sa_flags = 448, sa_restorer = 0x7f5ff595d240}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007f5ff76edee5 in _bpf_append_blk (prg=0x7f5ff5964010, blk=0x7f5ff59df1a0) at gen_bpf.c:452
        rc = -12
        i_new = 0x0
        i_iter = 0x7f5ff59fa178
        old_cnt = 48
        iter = 1
#3  0x00007f5ff76f3716 in _gen_bpf_build_bpf (state=0x7f5fcae302d0, col=0x7f5ff59c5000) at gen_bpf.c:2223
        rc = 0
        iter = 1
        h_val = 1425818561
        res_cnt = 0
        jmp_len = 0
        arch_x86_64 = 0
        arch_x32 = -1
        instr = {op = 32, jt = {tgt = {imm_j = 0 '\000', imm_k = 0, hash = 0, db = 0x0, blk = 0x0, nxt = 0}, type = TGT_NONE}, jf = {tgt = {
              imm_j = 0 '\000', imm_k = 0, hash = 0, db = 0x0, blk = 0x0, nxt = 0}, type = TGT_NONE}, k = {tgt = {imm_j = 4 '\004', imm_k = 4, 
              hash = 4, db = 0x4, blk = 0x4, nxt = 4}, type = TGT_K}}
        i_iter = 0x7f5ff59e1b60
        b_badarch = 0x7f5ff59de000
        b_default = 0x7f5ff59de060
        b_head = 0x7f5ff59df1a0
        b_tail = 0x7f5ff59de000
        b_iter = 0x7f5ff59df1a0
        b_new = 0x7f5ff59e8300
        b_jmp = 0x7f5ff59df0e0
        db_secondary = 0x0
        pseudo_arch = {token = 0, token_bpf = 0, size = ARCH_SIZE_UNSPEC, endian = ARCH_ENDIAN_LITTLE, syscall_resolve_name = 0x0, 
          syscall_resolve_num = 0x0, syscall_rewrite = 0x0, rule_add = 0x0}
#4  0x00007f5ff76f3874 in gen_bpf_generate (col=0x7f5ff59c5000, prgm_ptr=0x7f5fcae30b40) at gen_bpf.c:2270
        rc = 0
        state = {htbl = {0x0, 0x7f5ff593ef80, 0x7f5ff593efe0, 0x7f5ff593efc0, 0x0, 0x7f5ff595d000, 0x7f5ff593ef60, 0x7f5ff593ef00, 
            0x0 <repeats 248 times>}, attr = 0x7f5ff59c5004, bad_arch_hsh = 889798935, def_hsh = 742199527, bpf = 0x7f5ff5964010, 
          arch = 0x7f5fcae301c0, b_head = 0x7f5ff59e8300, b_tail = 0x7f5ff59de120, b_new = 0x7f5ff59e8300}
        prgm = <optimized out>
#5  0x00007f5ff76eb275 in sys_filter_load (col=0x7f5ff59c5000, rawrc=false) at system.c:307
        rc = 0
        prgm = 0x0
#6  0x00007f5ff76e9505 in seccomp_load (ctx=0x7f5ff59c5000) at api.c:386
        col = 0x7f5ff59c5000
        rawrc = false

https://github.com/seccomp/libseccomp/blob/34bf78abc9567b66c72dbe67e7f243072162a25f/src/gen_bpf.c#L449-L452

So it's realloc failing again, and _bpf_append_blk is returning -ENOMEM that gets masked by _gen_bpf_build_bpf and turned into -EFAULT. This isn't a big deal, but since you said better error reporting is a target of 2.5, figured I'd mention it since this looks in-scope :slightly_smiling_face:

Some poking with GDB:

(gdb) f 2
#2  0x00007f5ff76edee5 in _bpf_append_blk (prg=0x7f5ff5964010, blk=0x7f5ff59df1a0) at gen_bpf.c:452
452         abort();
(gdb) info args
prg = 0x7f5ff5964010
blk = 0x7f5ff59df1a0
(gdb) print prg->blks
$4 = (bpf_instr_raw *) 0x7f5ff59fa000
(gdb) x/32bx &prg->blks
0x7f5ff5964018: 0x00    0xa0    0x9f    0xf5    0x5f    0x7f    0x00    0x00
0x7f5ff5964020: 0x5a    0x5a    0x5a    0x5a    0x5a    0x5a    0x5a    0x5a
0x7f5ff5964028: 0x5a    0x5a    0x5a    0x5a    0x5a    0x5a    0x5a    0x5a
0x7f5ff5964030: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
(gdb) print ((prg)->blk_cnt * sizeof(*((prg)->blks)))
$5 = 392
(gdb) print prg->blk_cnt
$6 = 49

This one really does start to look like an allocator failure...

Aha, this story has finally reached its _thrilling_ conclusion — I've figured out what's been happening, and verified a fix :slightly_smiling_face:

Since it might make for an interesting story, here it is:

The main process that forks off the worker usually sits at ~80mb RSS. After it forks, it restricts memory usage via rlimit, sometimes to 64mb. This puts it in a position where its current memory usage exceeds its limit, but this is allowed by rlimit. _Most_ of the time, the memory allocator will have enough free memory lying around to service libseccomp's initialization routines without requesting more from the kernel. But when it _doesn't_, and needs to request space for an extra arena or something, the kernel won't provide it since the process is already over its limit.

In 2.4.3, this failure to obtain memory manifested in EINVAL and a double-free. In master post-https://github.com/seccomp/libseccomp/commit/3a1d1c977065f204b96293cccfe7d3e5aa0d7ace, EFAULT is reported instead. With https://github.com/seccomp/libseccomp/pull/257 applied, ENOMEM is correctly reported.

The reason this happens so rarely then becomes obvious: it's entirely reliant on whether the allocator has enough memory lying around to build the BPF program without requesting more from the kernel. glibc's allocator is more loose about allowing fragmentation to build up, so this never happened with it in place. jemalloc places tighter bounds, and leads to an increased probability of needing to request memory during seccomp_load — just enough to notice the resulting failures, but still be infuriating to track down.

The fix, then, is simply to move all setrlimit calls to _after_ seccomp_load. In doing so, realloc no longer fails in _bpf_append_blk, and the filter loads successfully. This does mean that the filter needs to allow setrlimit, but in my case this was acceptable. More generally, I think this issue would be resolved by something like https://github.com/seccomp/libseccomp/issues/123.

@pcmoore, @drakenclimber -- thanks again for all your help in debugging this issue! I'm glad I can put it behind me now, but your pointers were invaluable in getting there :smiley:

Was this page helpful?
0 / 5 - 0 ratings