-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use performMajorGC #257
Comments
Hm. I'm not quite sure what to make of this, since the choice of |
The original issue was that “fast” functions took an extremely long time to benchmark. The ticket should have some examples. If I remember correctly, this blog post served as the main source of inspiration. https://blog.janestreet.com/core_bench-micro-benchmarking-for-ocaml/ The gist of the argument is something along the lines of the GC will always affect the numbers, and the goal is to collect enough stats to mitigate the effect. |
In #187 both pre-benchmark The claim in #187 in that GC is performed 4 times per sample is a red flag. Only once should be enough for time measurement, and that one can be major. EDIT: There were two GCs in |
I'm also confused why the reported times are different in #187 in before and after. If the claim is that GC affects only the performance of GC suite, not the performance of the benchmarks themselves! before
and after
What am I missing? |
I want also remind that setting nursery size is very significant. For example running With large nursery, virtually every major GC would come Using major GC makes criterion spend more time in GC, I'm still puzzled why using major GC before measurements Default nursery size:With minor GC
With major GC:
-A64mWith minor GC
With major GC
First benchmark issue:Looks like if the first benchmark is of "right" size, EDIT: Doesn't seem to matter whether I do minor or major GC before the measurement. With
(The shape of With
|
Yes, this definitely sounds correct. Although based on your samples it seems like even putting a major GC before every loop doesn't completely eliminate the issue.
I would guess that at the time I interpreted < 1ns as within the "noise" range of measurement.
To some extent, allocation + GC time are a component of a given function's performance. The benchmark captures that by letting GC's naturally occur and then smoothing out the curve to include them. The benchmark is looking to collect a certain threshold of acceptable measurements by progressively increasing the number of executions per measurement: criterion/Criterion/Measurement.hs Line 220 in 52ef4a7
If your function takes 5ms to run, then the benchmark will only include measurements that include > 6 executions (>30ms total). It will continue increasing executions until the sum of all the valid measurements is > 300ms (10 * 30ms). What this means is that anything above a certain ratio of allocations/time, you'll always trigger GCs. Increasing the nursery size increases the minimum ratio that would trigger GCs, but the "right" value depends on your function.
Looks like I left a related comment when looking into the stats available from the GC result: #187 (comment) . I think it is possible to extract mutator time + gc time to potentially eliminate GC from the measurement, but I'm not sure that gives the best picture of performance. I believe we've traded precision (consistent, mutator-only measurements) for accuracy (slightly less precise, but adding in a measured impact from GC). The original concern was that the lack of major GC's was skewing the numbers, but your recent results are showing that it might be an issue about the earlier benchmarks affecting later ones? Is that the current state of the investigation? |
I disagree.
Allocation is indeed a component of function's performance, and I don't think it makes sense to try to isolate that. (Allocation rate and time are correlated). However, GC occuring or not is a global property of a program and RTS settings. So I interpret this as |
I'm remembering now that the GC stats are presented as a diff since the last one. So the pre-loop GC acts as a baseline, and the post-loop GC is used to update GC stats. What's the benefit of skipping those minor GCs? Just the benchmark running faster, or are you expecting the results to be affected? |
I expect to result be different as GCs will become less deterministic, but that is good isn't it, You said
We shouldn't favor small inputs (which may run without needing to do a single minor GCs), and instead let them (non-deterministically) have some too. |
Yeah, I suppose that is consistent with the rest of the thinking. |
In 1.4.0.0 the
performGC
was changed toperformMinorGC
While that is true, it also "destroys" accuracy of a bit slower functions. In case when
measure
(which allocates relatively bigMeasure
object on each iteration, and there are enough iterations, triggers a GC which must be major: that run is a massive outlier.IMHO, it's simpler just to accept that GC have to be run, this is not right place to optimize for performance.
The text was updated successfully, but these errors were encountered: