Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup(engines): detach per-cpu kernel metrics from global kernel metrics #2031

Merged
merged 6 commits into from
Sep 5, 2024

Conversation

Andreagit97
Copy link
Member

@Andreagit97 Andreagit97 commented Aug 28, 2024

What type of PR is this?

/kind cleanup

Any specific area of the project related to this PR?

/area libscap-engine-bpf

/area libscap-engine-kmod

/area libscap-engine-modern-bpf

Does this PR require a change in the driver versions?

No

What this PR does / why we need it:

As explained in issue #2028, it is better to split the per-CPU counters from the global counters for verbosity reasons.
More in detail when the per-CPU counters are enabled, libscap under the hood also enables the global counters. This is done to avoid a double loop over all the CPUs and to keep the code simpler without duplications. The idea behind this choice is that usually, a user should enable the per-cpu stats to obtain more insights with respect to the global ones so the global ones should be already enabled...

Which issue(s) this PR fixes:

Fixes #2028

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

cleanup(engines): detach per-cpu kernel metrics from global kernel metrics

Copy link

Please double check driver/API_VERSION file. See versioning.

/hold

Copy link

codecov bot commented Aug 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.30%. Comparing base (bf3c89b) to head (6aaec6c).
Report is 26 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2031   +/-   ##
=======================================
  Coverage   74.30%   74.30%           
=======================================
  Files         253      253           
  Lines       30966    30966           
  Branches     5397     5400    +3     
=======================================
  Hits        23010    23010           
- Misses       7932     7946   +14     
+ Partials       24       10   -14     
Flag Coverage Δ
libsinsp 74.30% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Aug 28, 2024

Perf diff from master - unit tests

     6.32%     -1.50%  [.] sinsp_evt::get_type
     5.83%     -1.31%  [.] next
     4.81%     +0.78%  [.] sinsp_parser::process_event
     0.82%     +0.73%  [.] 0x00000000000e93c0
     9.74%     +0.72%  [.] sinsp_parser::reset
     6.77%     +0.60%  [.] sinsp::next
     0.23%     +0.58%  [.] sinsp_parser::parse_rw_exit
     2.02%     +0.56%  [.] scap_event_decode_params
     4.21%     -0.44%  [.] gzfile_read
     0.50%     +0.42%  [.] sinsp_container_info::sinsp_container_info

Heap diff from master - unit tests

peak heap memory consumption: 0B
peak RSS (including heaptrack overhead): 0B
total memory leaked: 0B

Heap diff from master - scap file

peak heap memory consumption: 0B
peak RSS (including heaptrack overhead): 0B
total memory leaked: 0B

Benchmarks diff from master

Comparing gbench_data.json to /root/actions-runner/_work/libs/libs/build/gbench_data.json
Benchmark                                                         Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------
BM_sinsp_split_mean                                            +0.0297         +0.0299           145           150           145           150
BM_sinsp_split_median                                          +0.0316         +0.0317           145           150           145           150
BM_sinsp_split_stddev                                          -0.3063         -0.3074             1             1             1             1
BM_sinsp_split_cv                                              -0.3263         -0.3275             0             0             0             0
BM_sinsp_concatenate_paths_relative_path_mean                  +0.0053         +0.0055            42            42            42            42
BM_sinsp_concatenate_paths_relative_path_median                +0.0056         +0.0057            42            42            42            42
BM_sinsp_concatenate_paths_relative_path_stddev                -0.4610         -0.4609             0             0             0             0
BM_sinsp_concatenate_paths_relative_path_cv                    -0.4639         -0.4638             0             0             0             0
BM_sinsp_concatenate_paths_empty_path_mean                     -0.0223         -0.0222            17            17            17            17
BM_sinsp_concatenate_paths_empty_path_median                   -0.0185         -0.0184            17            17            17            17
BM_sinsp_concatenate_paths_empty_path_stddev                   -0.8604         -0.8600             0             0             0             0
BM_sinsp_concatenate_paths_empty_path_cv                       -0.8572         -0.8568             0             0             0             0
BM_sinsp_concatenate_paths_absolute_path_mean                  +0.0392         +0.0393            43            44            43            44
BM_sinsp_concatenate_paths_absolute_path_median                +0.0433         +0.0434            43            45            43            45
BM_sinsp_concatenate_paths_absolute_path_stddev                +3.9479         +3.9463             0             1             0             1
BM_sinsp_concatenate_paths_absolute_path_cv                    +3.7615         +3.7594             0             0             0             0
BM_sinsp_split_container_image_mean                            +0.0009         +0.0011           349           349           349           349
BM_sinsp_split_container_image_median                          -0.0026         -0.0025           350           349           349           349
BM_sinsp_split_container_image_stddev                          -0.1712         -0.1718             3             3             3             3
BM_sinsp_split_container_image_cv                              -0.1720         -0.1727             0             0             0             0

* The following `if` handle the case in which we want to get the metrics per CPU but not the global ones.
* It is an unsual case but at the moment we support it.
*/
if ((flags & METRICS_V2_KERNEL_COUNTERS_PER_CPU) && !(flags & METRICS_V2_KERNEL_COUNTERS))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[EARLY FEEDBACK]

Actually, i'm managing also this weird case in which we have the per-CPU stats enabled but not the global ones... I cannot think of a real use case for it so I'm not sure we want to keep it. WDYT? @FedeDP @incertum

Copy link
Contributor

@FedeDP FedeDP Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that if KERNEL_COUNTERS are disabled, KERNEL_COUNTERS_PER_CPU must be disabled too!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see 3 options:

  1. handling the 2 stats separately (so looping 2 times among CPUs if both are enabled)
  2. handling the 2 stats together so METRICS_V2_KERNEL_COUNTERS_PER_CPU can be enabled only if METRICS_V2_KERNEL_COUNTERS is enabled. We create a dependecy
  3. like case 1 but we have a duplicated logic that allow us to loop just once if both metrics are enabled (implemented in this PR)

Copy link
Contributor

@FedeDP FedeDP Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go with 2, easier and expected since both metric flags share same prefix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep makes sense, we just need to put somewhere a log that warns the user if it enables a flag without the other

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also say that if only METRICS_V2_KERNEL_COUNTERS_PER_CPU is passed, we silently enable METRICS_V2_KERNEL_COUNTERS too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also say that if only METRICS_V2_KERNEL_COUNTERS_PER_CPU is passed, we silently enable METRICS_V2_KERNEL_COUNTERS too.

Great idea, i will go for it!

@FedeDP
Copy link
Contributor

FedeDP commented Aug 28, 2024

/milestone 0.18.0

@poiana poiana added this to the 0.18.0 milestone Aug 28, 2024
@Andreagit97 Andreagit97 changed the title [WIP] cleanup(engines): detach per-cpu kernel metrics from global kernel metrics cleanup(engines): detach per-cpu kernel metrics from global kernel metrics Aug 28, 2024
@Andreagit97 Andreagit97 marked this pull request as ready for review August 28, 2024 16:44
@poiana poiana requested a review from leogr August 28, 2024 16:44
@@ -193,8 +193,8 @@ TEST(kmod, metrics_v2_check_per_CPU_stats)

ssize_t num_online_CPUs = sysconf(_SC_NPROCESSORS_ONLN);

// We want to check our CPUs counters
uint32_t flags = METRICS_V2_KERNEL_COUNTERS;
// Enabling `METRICS_V2_KERNEL_COUNTERS_PER_CPU` we also enable `METRICS_V2_KERNEL_COUNTERS`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should definitely unify the test for the 3 engines in some way because we are copying and pasting the same code 3 times for all the tests, in the end, the interface is the same... BTW I'm not doing it in this PR :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree!

userspace/libpman/src/stats.c Show resolved Hide resolved
@@ -302,6 +302,12 @@ int32_t scap_get_stats(scap_t* handle, scap_stats* stats)
//
const struct metrics_v2* scap_get_stats_v2(scap_t* handle, uint32_t flags, uint32_t* nstats, int32_t* rc)
{
// If we enable per-cpu counters, we also enable kenrel global counters by default.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as suggested by @FedeDP

@FedeDP
Copy link
Contributor

FedeDP commented Aug 29, 2024

Left a minor suggestion, otherwise LGTM!

@Andreagit97
Copy link
Member Author

We should be fine now :)

@FedeDP
Copy link
Contributor

FedeDP commented Aug 31, 2024

There is a "tmp" commit still 😄

@Andreagit97
Copy link
Member Author

Andreagit97 commented Sep 2, 2024

There is a "tmp" commit still 😄

reworded it, thank you!

FedeDP
FedeDP previously approved these changes Sep 2, 2024
Copy link
Contributor

@FedeDP FedeDP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@poiana
Copy link
Contributor

poiana commented Sep 2, 2024

LGTM label has been added.

Git tree hash: 8b865d3317e02cb4c9dddb77f2c8fda7cfae22d6

Signed-off-by: Andrea Terzolo <[email protected]>
Co-authored-by: Melissa Kilby <[email protected]>
Signed-off-by: Andrea Terzolo <[email protected]>
Co-authored-by: Melissa Kilby <[email protected]>
switch(stat)
{
case RUN_CNT:
strlcat(stats[offset].name, bpf_libbpf_stats_names[RUN_CNT], sizeof(stats[offset].name));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up here: Shouldn't this stay here, because we concat the name according to the switch statement?
Also we seem to use stat for the loop and here for the switch statement. Perhaps let's use separate wording for clarity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to call strlcat just once with the generic variable stat

strlcat(stats[offset].name, bpf_libbpf_stats_names[stat], sizeof(stats[offset].name));

instead of triplicating the same line 3 times using an explicit enum

strlcat(stats[offset].name, bpf_libbpf_stats_names[RUN_CNT], sizeof(stats[offset].name));
strlcat(stats[offset].name, bpf_libbpf_stats_names[RUN_TIME_NS], sizeof(stats[offset].name));
strlcat(stats[offset].name, bpf_libbpf_stats_names[AVG_TIME_NS], sizeof(stats[offset].name));

Also we seem to use stat for the loop and here for the switch statement. Perhaps let's use separate wording for clarity?

I am not sure I got this, we are using stat (the index of the array) in the switch case to select the right metric

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked again and yes stat is the index to the bpf_libbpf_stats ... I suppose a big confusion with stat and offset. Thanks for clarifying and also working on this.

@@ -1849,22 +1856,20 @@ const struct metrics_v2* scap_bpf_get_stats_v2(struct scap_engine_handle engine,
{
strlcpy(stats[offset].name, info.name, METRIC_NAME_MAX);
}
strlcat(stats[offset].name, bpf_libbpf_stats_names[stat], sizeof(stats[offset].name));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] re comment here https://github.com/falcosecurity/libs/pull/2031/files#diff-12833abd4271488260dae0ba178c6ad3f0bc63642f793a20b06ab4eb10d02cf9L1839 libbpf stats were introduced w/ kernel 5.1 so folks with lower kernels can't reach this code since we check for libbpf stats being enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right but since usually many bpf features are backported I'm not so confident in removing it... I found this commit 957ab1c, unfortunately, I don't remember why I added it but i bet I had found an issue on some old machines...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, yes the backports.

Copy link
Contributor

@incertum incertum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@poiana
Copy link
Contributor

poiana commented Sep 5, 2024

LGTM label has been added.

Git tree hash: 667ee5b9ecfd8e1c9ef82047cc3269bfe7a37aac

Copy link
Contributor

@FedeDP FedeDP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@poiana
Copy link
Contributor

poiana commented Sep 5, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Andreagit97, FedeDP, incertum

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Andreagit97,FedeDP,incertum]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@FedeDP
Copy link
Contributor

FedeDP commented Sep 5, 2024

/unhold

@poiana poiana merged commit b632379 into falcosecurity:master Sep 5, 2024
45 of 49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move detailed per CPU kernel stats under a new flag METRICS_V2_KERNEL_COUNTERS_PER_CPU
4 participants