Libseccomp: BUG: don't generate BPF for pseudo-syscalls

Created on 15 Jun 2020  ·  20Comments  ·  Source: seccomp/libseccomp

See issue #249, specifically this comment:

Why check for non-existent syscall 4294957285 in actually generated bpf?

We shouldn't, and we didn't used to do that, but it looks like current libseccomp has a bug here.

Credit to @vt-alt for reporting this bug.

bug prioritmedium

All 20 comments

REPRODUCER REMOVED IN FAVOR OF IMPROVED DRAFT BELOW

To be clear, we should still emit the pseudo-syscalls in the PFC, for debugging purposes, but we shouldn't emit a (useless) BPF filter rule.

Thanks for creating the issue from my report.

To be clear, we should still emit the pseudo-syscalls in the PFC, for debugging purposes, but we shouldn't emit a (useless) BPF filter rule.

Please rethink this. I believe this will only complicate and obscure things if PFC will not reflect BPF.

Please rethink this. I believe this will only complicate and obscure things if PFC will not reflect BPF.

If the syscall is missing from the PFC we will receive a number of bogus bug reports talking about how the library is failing to add a filter for (non-existent) syscalls.

For those who are understand the concept of the pseudo-syscalls it is a trivial exercise to remove them from the PFC output. It's also worth mentioning that the PFC output is not intended to be an exact copy of the BPF output, it's there simply as a debugging tool and an easy way to visualize the generated filter code.

easy way to visualize the generated filter code

But, then it will not visualize _generated_ filter code since in the generated code pseudo syscalls should be absent.

If the syscall is missing from the PFC we will receive a number of bogus bug reports talking about how the library is failing to add a filter for (non-existent) syscalls.

It's much easier to understand that pseudo syscall checks should not be present in the code (as in 'optimized out', because ''there are no such syscalls for the arch'), than the logic difference (presence and absence of conditionals) in the visualization of code and the actual code.

I would prefer PFC to reflect bpf than to use scmp_bpf_disasm since PFC output is much easier to read.

I understand your concern @vt-alt, and perhaps in a future release we do that with PFC, but it is my opinion that removing the pseudo-syscalls from the PFC output is a mistake.

@drakenclimber do you have any strong opinions on this?

Revised reproducer for x86_64:

#include <stdlib.h>
#include <errno.h>

#include <seccomp.h>

#include "util.h"

int main(int argc, char *argv[])
{
        int rc;
        struct util_options opts;
        scmp_filter_ctx ctx = NULL;

        rc = util_getopt(argc, argv, &opts);
        if (rc < 0)
                goto out;

        ctx = seccomp_init(SCMP_ACT_KILL);
        if (ctx == NULL)
                return ENOMEM;

        rc = seccomp_arch_add(ctx, SCMP_ARCH_X32);
        if (rc < 0)
                goto out;

        rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(access), 0);
        if (rc < 0)
                goto out;
        rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(arm_fadvise64_64), 0);
        if (rc < 0)
                goto out;

        rc = util_filter_output(&opts, ctx);
        if (rc)
                goto out;

out:
        seccomp_release(ctx);
        return (rc < 0 ? -rc : rc);
}

UPDATED: fixed a problem with TSKIP

Not yet fully tested, but this may be a fix - can you verify this is reasonable for the balanced tree optimization algorithm @drakenclimber?

diff --git a/src/arch-arm.c b/src/arch-arm.c
index 3465111..4dd4b63 100644
--- a/src/arch-arm.c
+++ b/src/arch-arm.c
@@ -54,7 +54,7 @@ int arm_syscall_resolve_name_munge(const char *name)
        if (sys == __NR_SCMP_ERROR)
                return sys;

-       return sys + __SCMP_NR_BASE;
+       return (sys | __SCMP_NR_BASE);
 }

 /**
@@ -68,7 +68,7 @@ int arm_syscall_resolve_name_munge(const char *name)
  */
 const char *arm_syscall_resolve_num_munge(int num)
 {
-       return arm_syscall_resolve_num(num - __SCMP_NR_BASE);
+       return arm_syscall_resolve_num(num & (~__SCMP_NR_BASE));
 }

 const struct arch_def arch_def_arm = {
diff --git a/src/arch-x32.c b/src/arch-x32.c
index 7b97fb3..3890968 100644
--- a/src/arch-x32.c
+++ b/src/arch-x32.c
@@ -43,7 +43,7 @@ int x32_syscall_resolve_name_munge(const char *name)
        if (sys == __NR_SCMP_ERROR)
                return sys;

-       return sys + X32_SYSCALL_BIT;
+       return (sys | X32_SYSCALL_BIT);
 }

 /**
@@ -57,7 +57,7 @@ int x32_syscall_resolve_name_munge(const char *name)
  */
 const char *x32_syscall_resolve_num_munge(int num)
 {
-       return x32_syscall_resolve_num(num - X32_SYSCALL_BIT);
+       return x32_syscall_resolve_num(num & (~X32_SYSCALL_BIT));
 }

 const struct arch_def arch_def_x32 = {
diff --git a/src/gen_bpf.c b/src/gen_bpf.c
index 55a7958..ae9c3f4 100644
--- a/src/gen_bpf.c
+++ b/src/gen_bpf.c
@@ -1555,6 +1555,10 @@ static int _gen_bpf_syscalls(struct bpf_state *state,
        for (s_iter = s_tail; s_iter != NULL; s_iter = s_iter->pri_prv) {
                if (!s_iter->valid)
                        continue;
+               /* skip pseudo-syscalls */
+               if ((s_iter->num & 0x80000000) &&
+                   (state->attr->api_tskip == 0 || s_iter->num != -1))
+                       continue;

                if (*bintree_levels > 0 &&
                    ((syscall_cnt + empty_cnt) % SYSCALLS_PER_NODE) == 0)

I understand your concern @vt-alt, and perhaps in a future release we do that with PFC, but it is my opinion that removing the pseudo-syscalls from the PFC output is a mistake.

@drakenclimber do you have any strong opinions on this?

Not really, but I think the root of the problem is that the PFC output is being used in multiple, divergent ways:

  1. As mentioned above, it's an easy way for new(er) users to roughly verify their filter
  2. More advanced users are expecting it to be a close approximation of the actual BPF filter

Perhaps we could add a --no-pseudo-syscalls flag to the PFC logic? Then it can stay the same for beginner users, but advanced users can get a better approximation of the BPF.

_UPDATED: fixed a problem with TSKIP_

Not yet fully tested, but this may be a fix - can you verify this is reasonable for the balanced tree optimization algorithm @drakenclimber?

Will do. I want to see if I can make an automated test that will reproduce this scenario.

_UPDATED: fixed a problem with TSKIP_
Not yet fully tested, but this may be a fix - can you verify this is reasonable for the balanced tree optimization algorithm @drakenclimber?

Will do. I want to see if I can make an automated test that will reproduce this scenario.

And of course I'll start with the reproducer test you have above. Thanks!

_UPDATED: fixed a problem with TSKIP_

Not yet fully tested, but this may be a fix - can you verify this is reasonable for the balanced tree optimization algorithm @drakenclimber?

The binary tree precalculates when the accumulator needs to be modified, and this is how it knows to insert the jge logic. Yanking out syscalls (like the proposal above) while building the filter breaks this logic.

I haven't fully tested my changes yet either :), but I'm pretty sure we will need something like this to make the binary tree work with the removal of pseudo-syscalls. I generated a BPF binary tree using this and it looks reasonable. As we get closer to a production-ready solution, I'll fully verify it.

@@ -1532,11 +1532,31 @@ static int _gen_bpf_syscalls(struct bpf_state *state,
                _sys_sort(db_secondary->syscalls, &s_head, &s_tail, optimize);

        if (optimize == 2) {
+               /* since pseudo-syscalls are removed from the filter, we need
+                * to calculate the syscall count by hand
+                */
+               for (s_iter = s_tail; s_iter != NULL; s_iter = s_iter->pri_prv) {
+                       if (!s_iter->valid)
+                               continue;
+
+                       /* skip pseudo-syscalls */
+                       if ((s_iter->num & 0x80000000) &&
+                           (state->attr->api_tskip == 0 || s_iter->num != -1))
+                               continue;
+
+                       syscall_cnt++;
+               }
+
                rc = _gen_bpf_init_bintree(&bintree_hashes, &bintree_syscalls,
-                                          bintree_levels, db->syscall_cnt,
+                                          bintree_levels, syscall_cnt,
                                           &empty_cnt);
                if (rc < 0)
                        goto out;
+
+               /* reset the syscall_cnt variable because later in this
+                * function it's used as a counter
+                */
+               syscall_cnt = 0;
        }

This could be made smarter/faster if we have a variable in the db structure tracking the "valid" syscall count.

Perhaps we could add a --no-pseudo-syscalls flag to the PFC logic?

Where would this flag go? I don't think we want this as a build time option. I suppose we could add a filter option, but I'm not really excited about that. I'd vote for sticking with the pseudo-syscalls in the PFC but dropping it from the BPF (obviously) for now, and if we need to augment this at some point in the future we can.

I haven't fully tested my changes yet either :), but I'm pretty sure we will need something like this to make the binary tree work with the removal of pseudo-syscalls. I generated a BPF binary tree using this and it looks reasonable. As we get closer to a production-ready solution, I'll fully verify it.

I suspected this was going to break the tree optimization.

@drakenclimber considering this impacts the tree sorting much more than the standard optimization, do you want to take this issue? Feel free to steal as much or as little of the code I copy and pasted above as makes sense.

One thing I think we should do regardless are the changes in "arch-arm.c" and "arch-x32.c" as it makes more sense.

Perhaps we could add a --no-pseudo-syscalls flag to the PFC logic?

Where would this flag go? I don't think we want this as a build time option. I suppose we could add a filter option, but I'm not really excited about that. I'd vote for sticking with the pseudo-syscalls in the PFC but dropping it from the BPF (obviously) for now, and if we need to augment this at some point in the future we can.

I admit that I typed before I thought. ;)

Yeah, it would have to be a filter option and that feels wrong. I agree; let's do as you outlined above. If we continue to get questions, then we can revisit.

I haven't fully tested my changes yet either :), but I'm pretty sure we will need something like this to make the binary tree work with the removal of pseudo-syscalls. I generated a BPF binary tree using this and it looks reasonable. As we get closer to a production-ready solution, I'll fully verify it.

I suspected this was going to break the tree optimization.

@drakenclimber considering this impacts the tree sorting much more than the standard optimization, do you want to take this issue? Feel free to steal as much or as little of the code I copy and pasted above as makes sense.

One thing I think we should do regardless are the changes in "arch-arm.c" and "arch-x32.c" as it makes more sense.

Sure. I can own this one.

@drakenclimber "> As mentioned above, it's an easy way for new(er) users to roughly verify their filter"

"Pseudo syscall" does not exist in the kernel and it's not a well-known concept either. It's purely the invention of libseccomp and it is not explained here. Only stated they are negative numbers and they appear when given syscall does not exist for the architecture. How is this different from non-existent syscall? For what purpose they are negative? The pseudo syscall concept is really confusing for new users.

I can't believe new(er) users want to see non-existent syscalls checked in their filters.

Let me add one point more. All this is in the field of security, where only careful and detailed understanding works. You creating new obscure concepts (of pseudo syscalls) and the difference between representations (bpf and pfc). Is this really intended for new users, to confuse them more?

@vt-alt thank you for expressing your concern, but for the v2.5.0 release we are going to suppress the pseudo-syscalls from the BPF filter, and continue to display them in the PFC filter. I recognize that you may disagree with this decision but I kindly ask you to respect this decision. In future releases we will do a better job explaining the purpose behind the pseudo-syscalls (it's for multiple ABI support) and documenting it in the manpages; I've created an issue for that (link below), you are welcome and encouraged to participate in that discussion and any PRs that result. It's also possible we revise our approach to the PFC filter in the future, but I don't want to promise anything here.

Once again, thank you for bringing the BPF problem to our attention; you've helped to improve the next libseccomp release!

Closing via #264.

Was this page helpful?
0 / 5 - 0 ratings