-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance impact measurement tool #24
Comments
I don't think that adding tracepoints is a good solution here. Possibly, it would be better to use kprobes. However, they have some issues compared to tracepoints:
What do you think? [1] https://lore.kernel.org/all/20230111065930.1494-1-cuiyunhui@bytedance.com/
|
I think kprobes/kretprobes would be fine from a performance point of view, but that may not be enough to identify requests to specific kernel objects (e.g. the performance impact of opening a specific file). That may not be an issue for benchmark if we can scope the probes to only the benchmarked process though. I think tracepoints are useful when we need to increment a counter or print variables according to code within a function or with a specific semantic (not necessarily passed as function argument). The kernel API change and especially the Landlock changes should not be a big issue because the benchmark tool would mostly live with the kernel source code and could be part of the Kselftests and be updated with the kernel code. |
Hello @sm1ling-knight -- just for coordination; did you start working on this bug already? |
Hello! Yep, I'm currently working on it. |
Good! We should explicit say when we started or want to work on a task to avoid duplicated work. 😉 |
@l0kod, do you have any cases of what events would be useful to trace in Landlock? I currently have an idea of tracing I also want to add CPU profiling to measure overhead of Landlock changes. What do you think? |
I expect On top of that, timing each LSM hook (including the Landlock hook implementations) called per syscall should give a good overview of Landlock's impact on the kernel's execution time.
That would be very useful! |
For reference, Flame Graphs are a visualization tool (relying on kernel features such as perf) useful to see a performance impact and to debug, but not to quickly compare two implementations nor to measure according to a context (e.g., depth of a file path) in an automatic (and deterministic as much as possible) way. |
In my mind, the percentage of time spent in Landlock would be an interesting metric to capture; In particular, the entry point functions are interesting, which are hit during program execution; that is: (a) the three syscalls, and (b) the LSM hooks which Landlock overrides. This is already recorded when doing
It might be wiser to replace /bin/ls with something more long running. With The result can then be plotted with Brendan Gregg's Flame graph scripts; I'm using this:
or it can be analyzed with This is just how far I got with the obvious part. I figured that
|
I don't think that perf is suitable for our case. Execution time of hooks may be too short for perf to provide sufficient accuracy. On my system execution of 'file_open' hook takes about 4e-5 seconds (in average) for sandboxed I got these values using bpftrace tool and implementing #define call_void_hook(FUNC, ...) \
do { \
struct security_hook_list *P; \
\
hlist_for_each_entry(P, &security_hook_heads.FUNC, list) { \
trace_lsm_hook_enter(P->lsmid->id, #FUNC); \
P->hook.FUNC(__VA_ARGS__); \
trace_lsm_hook_exit(P->lsmid->id, #FUNC); \
} \
} while (0) Tracepoints were used to gather total execution time and total number of calls for 'file_open' hook and openat syscall. You can see bpftrace script similar to the one I used for gathering at the end of the message It seems that sampling frequency should be at least on the order of 10^-5 seconds for adequate accuracy. From Brendan Gregg's blog: It turns out that the sampling frequency that does not involve overhead is on the order of 10^-2. Using perf would be unacceptable in order to measure time taken by hook accurately enough. Moreover, using perf involves problem with tracing static symbols in fs.c, net.c (such as hooks) and extracting CPU percentages (as you mentioned). I suggest using bpftrace tool for profiling. It seems like quite convenient way to measure execution time for our case with almost zero overhead. Bpftrace allows to add gathering instructions for every required tracing blob (tracepoint, kprobe, uprobe..). To measure time in this way, following steps are required:
|
Because we can change the Landlock/kernel code, I don't think we should be limited by any sampling frequency. Your bpftrace script looks good! I guess we should not need more dependency than this tool.
We indeed need all benchmarks to be simply run, eventually by a CI. However, it would help to also be able to create a flamegraph from these benchmarks, but that could be another iteration of the patches.
We should have both microbenchmarks and real-world benchmarks. It looks like bpftrace scripts should be good for both cases. The main intensive and real-world scenarios that come to mind are:
|
Hello! Can i send the implementation as landlock-test-tools pull request? (when it's ready) |
As for the seccomp benchmark tool, it would help to keep similar tools in the kernel source tree e.g., to easily run performance tests with a kernel CI, and to add reference to performance improvements in commit messages. Moreover, the benchmark scripts may need to evolve along with future kernel versions (e.g. if an LSM hooks change). I think higher level tooling (e.g. generate flame graphs, compare performance of two kernel versions) would make sense in landlock-test-tools though. |
Do you agree that adding tracepoints |
Yes, that looks reasonable to me. |
Btw, i decided to try bcc to solve this issue. It's a bit more verbose compared to bpftrace, but it makes it easier to work with command line arguments and custom output (for manual usage). I think it doesn't require a lot of dependencies, so I'll try to make the first patch using bcc. It should be easy to switch into bpftrace, if something goes wrong. |
FYI, patches fixing an issue in perf with Landlock and BTF: |
commit 4a058b3 upstream. KASAN reported a null-ptr-deref issue when executing the following command: # echo ts2020 0x20 > /sys/bus/i2c/devices/i2c-0/new_device KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] CPU: 53 UID: 0 PID: 970 Comm: systemd-udevd Not tainted 6.12.0-rc2+ #24 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009) RIP: 0010:ts2020_probe+0xad/0xe10 [ts2020] RSP: 0018:ffffc9000abbf598 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffffc0714809 RDX: 0000000000000002 RSI: ffff88811550be00 RDI: 0000000000000010 RBP: ffff888109868800 R08: 0000000000000001 R09: fffff52001577eb6 R10: 0000000000000000 R11: ffffc9000abbff50 R12: ffffffffc0714790 R13: 1ffff92001577eb8 R14: ffffffffc07190d0 R15: 0000000000000001 FS: 00007f95f13b98c0(0000) GS:ffff888149280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555d2634b000 CR3: 0000000152236000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ts2020_probe+0xad/0xe10 [ts2020] i2c_device_probe+0x421/0xb40 really_probe+0x266/0x850 ... The cause of the problem is that when using sysfs to dynamically register an i2c device, there is no platform data, but the probe process of ts2020 needs to use platform data, resulting in a null pointer being accessed. Solve this problem by adding checks to platform data. Fixes: dc245a5 ("[media] ts2020: implement I2C client bindings") Cc: <stable@vger.kernel.org> Signed-off-by: Li Zetao <lizetao1@huawei.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 146b6f1 ] Under certain kernel configurations when building with Clang/LLVM, the compiler does not generate a return or jump as the terminator instruction for ip_vs_protocol_init(), triggering the following objtool warning during build time: vmlinux.o: warning: objtool: ip_vs_protocol_init() falls through to next function __initstub__kmod_ip_vs_rr__935_123_ip_vs_rr_init6() At runtime, this either causes an oops when trying to load the ipvs module or a boot-time panic if ipvs is built-in. This same issue has been reported by the Intel kernel test robot previously. Digging deeper into both LLVM and the kernel code reveals this to be a undefined behavior problem. ip_vs_protocol_init() uses a on-stack buffer of 64 chars to store the registered protocol names and leaves it uninitialized after definition. The function calls strnlen() when concatenating protocol names into the buffer. With CONFIG_FORTIFY_SOURCE strnlen() performs an extra step to check whether the last byte of the input char buffer is a null character (commit 3009f89 ("fortify: Allow strlen() and strnlen() to pass compile-time known lengths")). This, together with possibly other configurations, cause the following IR to be generated: define hidden i32 @ip_vs_protocol_init() local_unnamed_addr #5 section ".init.text" align 16 !kcfi_type !29 { %1 = alloca [64 x i8], align 16 ... 14: ; preds = %11 %15 = getelementptr inbounds i8, ptr %1, i64 63 %16 = load i8, ptr %15, align 1 %17 = tail call i1 @llvm.is.constant.i8(i8 %16) %18 = icmp eq i8 %16, 0 %19 = select i1 %17, i1 %18, i1 false br i1 %19, label %20, label %23 20: ; preds = %14 %21 = call i64 @strlen(ptr noundef nonnull dereferenceable(1) %1) #23 ... 23: ; preds = %14, %11, %20 %24 = call i64 @strnlen(ptr noundef nonnull dereferenceable(1) %1, i64 noundef 64) #24 ... } The above code calculates the address of the last char in the buffer (value %15) and then loads from it (value %16). Because the buffer is never initialized, the LLVM GVN pass marks value %16 as undefined: %13 = getelementptr inbounds i8, ptr %1, i64 63 br i1 undef, label %14, label %17 This gives later passes (SCCP, in particular) more DCE opportunities by propagating the undef value further, and eventually removes everything after the load on the uninitialized stack location: define hidden i32 @ip_vs_protocol_init() local_unnamed_addr #0 section ".init.text" align 16 !kcfi_type !11 { %1 = alloca [64 x i8], align 16 ... 12: ; preds = %11 %13 = getelementptr inbounds i8, ptr %1, i64 63 unreachable } In this way, the generated native code will just fall through to the next function, as LLVM does not generate any code for the unreachable IR instruction and leaves the function without a terminator. Zero the on-stack buffer to avoid this possible UB. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202402100205.PWXIz1ZK-lkp@intel.com/ Co-developed-by: Ruowen Qin <ruqin@redhat.com> Signed-off-by: Ruowen Qin <ruqin@redhat.com> Signed-off-by: Jinghao Jia <jinghao7@illinois.edu> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 4a058b3 upstream. KASAN reported a null-ptr-deref issue when executing the following command: # echo ts2020 0x20 > /sys/bus/i2c/devices/i2c-0/new_device KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] CPU: 53 UID: 0 PID: 970 Comm: systemd-udevd Not tainted 6.12.0-rc2+ #24 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009) RIP: 0010:ts2020_probe+0xad/0xe10 [ts2020] RSP: 0018:ffffc9000abbf598 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffffc0714809 RDX: 0000000000000002 RSI: ffff88811550be00 RDI: 0000000000000010 RBP: ffff888109868800 R08: 0000000000000001 R09: fffff52001577eb6 R10: 0000000000000000 R11: ffffc9000abbff50 R12: ffffffffc0714790 R13: 1ffff92001577eb8 R14: ffffffffc07190d0 R15: 0000000000000001 FS: 00007f95f13b98c0(0000) GS:ffff888149280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555d2634b000 CR3: 0000000152236000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ts2020_probe+0xad/0xe10 [ts2020] i2c_device_probe+0x421/0xb40 really_probe+0x266/0x850 ... The cause of the problem is that when using sysfs to dynamically register an i2c device, there is no platform data, but the probe process of ts2020 needs to use platform data, resulting in a null pointer being accessed. Solve this problem by adding checks to platform data. Fixes: dc245a5 ("[media] ts2020: implement I2C client bindings") Cc: <stable@vger.kernel.org> Signed-off-by: Li Zetao <lizetao1@huawei.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 146b6f1 ] Under certain kernel configurations when building with Clang/LLVM, the compiler does not generate a return or jump as the terminator instruction for ip_vs_protocol_init(), triggering the following objtool warning during build time: vmlinux.o: warning: objtool: ip_vs_protocol_init() falls through to next function __initstub__kmod_ip_vs_rr__935_123_ip_vs_rr_init6() At runtime, this either causes an oops when trying to load the ipvs module or a boot-time panic if ipvs is built-in. This same issue has been reported by the Intel kernel test robot previously. Digging deeper into both LLVM and the kernel code reveals this to be a undefined behavior problem. ip_vs_protocol_init() uses a on-stack buffer of 64 chars to store the registered protocol names and leaves it uninitialized after definition. The function calls strnlen() when concatenating protocol names into the buffer. With CONFIG_FORTIFY_SOURCE strnlen() performs an extra step to check whether the last byte of the input char buffer is a null character (commit 3009f89 ("fortify: Allow strlen() and strnlen() to pass compile-time known lengths")). This, together with possibly other configurations, cause the following IR to be generated: define hidden i32 @ip_vs_protocol_init() local_unnamed_addr #5 section ".init.text" align 16 !kcfi_type !29 { %1 = alloca [64 x i8], align 16 ... 14: ; preds = %11 %15 = getelementptr inbounds i8, ptr %1, i64 63 %16 = load i8, ptr %15, align 1 %17 = tail call i1 @llvm.is.constant.i8(i8 %16) %18 = icmp eq i8 %16, 0 %19 = select i1 %17, i1 %18, i1 false br i1 %19, label %20, label %23 20: ; preds = %14 %21 = call i64 @strlen(ptr noundef nonnull dereferenceable(1) %1) #23 ... 23: ; preds = %14, %11, %20 %24 = call i64 @strnlen(ptr noundef nonnull dereferenceable(1) %1, i64 noundef 64) #24 ... } The above code calculates the address of the last char in the buffer (value %15) and then loads from it (value %16). Because the buffer is never initialized, the LLVM GVN pass marks value %16 as undefined: %13 = getelementptr inbounds i8, ptr %1, i64 63 br i1 undef, label %14, label %17 This gives later passes (SCCP, in particular) more DCE opportunities by propagating the undef value further, and eventually removes everything after the load on the uninitialized stack location: define hidden i32 @ip_vs_protocol_init() local_unnamed_addr #0 section ".init.text" align 16 !kcfi_type !11 { %1 = alloca [64 x i8], align 16 ... 12: ; preds = %11 %13 = getelementptr inbounds i8, ptr %1, i64 63 unreachable } In this way, the generated native code will just fall through to the next function, as LLVM does not generate any code for the unreachable IR instruction and leaves the function without a terminator. Zero the on-stack buffer to avoid this possible UB. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202402100205.PWXIz1ZK-lkp@intel.com/ Co-developed-by: Ruowen Qin <ruqin@redhat.com> Signed-off-by: Ruowen Qin <ruqin@redhat.com> Signed-off-by: Jinghao Jia <jinghao7@illinois.edu> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
We need tooling to measure the performance impact of kernel changes. Until now, we used simple scripts to get an idea of the worse case scenarios, and manual profiling.
It would be useful to have a way to measure in a predictive way the overhead of sandboxing on a set of scenarios (e.g. path walk) with synthetic benchmarks. Several tracing kernel features might be interesting: tracepoints, eBPF...
We might need kernel changes such as tracepoints.
This measurement tool should be simple to use manually and by a CI (see landlock-test-tools). It would be useful to measure performance improvement changes (#1 and #9).
See seccomp_benchmark.c and other benchmark programs in tools/testing/selftests.
The text was updated successfully, but these errors were encountered: