Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: fix buffer release timing in processUnaryRPC #7998

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

lqs
Copy link

@lqs lqs commented Jan 10, 2025

Previously, the buffer d was held for the entire duration of the handler execution in processUnaryRPC, leading to unnecessary memory retention. This PR adjusts the code to release the buffer immediately after the decoding function df completes, ensuring memory is freed as soon as it is no longer needed.

As a result, the workaround in #7972 to adjust the buffer size is no longer required, as this fix resolves the underlying issue.

RELEASE NOTES:

  • grpc: The buffer used for decoding messages in unary RPCs is released after decoding. Earlier it was held till RPC completion.

Copy link

codecov bot commented Jan 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.14%. Comparing base (d118866) to head (018e208).
Report is 6 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7998      +/-   ##
==========================================
+ Coverage   81.94%   82.14%   +0.19%     
==========================================
  Files         381      381              
  Lines       38539    38548       +9     
==========================================
+ Hits        31582    31664      +82     
+ Misses       5626     5579      -47     
+ Partials     1331     1305      -26     
Files with missing lines Coverage Δ
server.go 82.71% <100.00%> (+0.26%) ⬆️

... and 21 files with indirect coverage changes

Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice catch!

server.go Outdated
df := func(v any) error {
defer d.Free()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified that the last reference to d is in the closure, so the change is safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dfawley, the decoding func is called by generated code. The present version of the generated code always calls this in the first couple of lines.

Do you think there is a chance that it may not get called? If so, should we wrap d.Free() in a OnceFunc and defer call it outside the closure to be extra cautious?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can't guarantee that the decode function would be called by someone using the generic API to register their own handler, instead of the generated code. They could have their own interceptor-like pattern that checks headers and doesn't call the decoder if that fails.

So yes, I think for correctness reasons we must defer a OnceFunc-wrapped d.Free(). It's not clear this would be a win for all cases. @lqs can you run our benchmarks before/after that change with 1-byte request/response sizes so that we can be sure it doesn't cause any surprises?

go run benchmark/benchmain/main.go -benchtime=10s -workloads=all \
-compression=gzip -maxConcurrentCalls=1 -trace=off \
-reqSizeBytes=1,1048576 -respSizeBytes=1,1048576 -networkMode=Local \
-cpuProfile=cpuProf -memProfile=memProf -memProfileRate=10000 -resultFile=result
As a suggestion, when creating a branch, you can run this benchmark and save the result
file "-resultFile=basePerf", and later when you at the middle of the work or finish the
work, you can get the benchmark result and compare it with the base anytime.
Assume there are two result files names as "basePerf" and "curPerf" created by adding
-resultFile=basePerf and -resultFile=curPerf.
To format the curPerf, run:
go run benchmark/benchresult/main.go curPerf
To observe how the performance changes based on a base result, run:
go run benchmark/benchresult/main.go basePerf curPerf

Something like this would run just the set of conditions that should matter here:

go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary \ 
 	  -compression=off -maxConcurrentCalls=100 -trace=off \ 
 	  -reqSizeBytes=1 -respSizeBytes=1 -networkMode=Local -resultFile=[BEFORE/AFTER]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the benchmark results, taken on a Ryzen 9 7945HX CPU. Each test group was run multiple times, and the fastest result was selected.

before

	df := func(v any) error {
		defer d.Free()
$ ~/sdk/go1.23.4/bin/go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary -compression=off -maxConcurrentCalls=100 -trace=off -reqSizeBytes=1 -respSizeBytes=1 -networkMode=Local -resultFile=before
go1.23.4/grpc1.71.0-dev
unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_100-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false:
50_Latency: 530.3120µs	90_Latency: 747.7040µs	99_Latency: 1405.5230µs	Avg_Latency: 581.5540µs	Bytes/op: 10153.135931747744	Allocs/op: 178.97030961211047
Histogram (unit: µs)
Count: 10295689  Min: 110.7  Max: 41317.2  Avg: 581.55
------------------------------------------------------------
[     110.655000,      110.656000)         1    0.0%    0.0%
[     110.656000,      110.662016)         0    0.0%    0.0%
[     110.662016,      110.704228)         0    0.0%    0.0%
[     110.704228,      111.000400)         0    0.0%    0.0%
[     111.000400,      113.078425)         0    0.0%    0.0%
[     113.078425,      127.658446)         0    0.0%    0.0%
[     127.658446,      229.956072)       164    0.0%    0.0%
[     229.956072,      947.705647)   9740824   94.6%   94.6%  #########
[     947.705647,     5983.643190)    554125    5.4%  100.0%  #
[    5983.643190,    41317.230000)       574    0.0%  100.0%
[   41317.230000,   289227.841760)         1    0.0%  100.0%
Number of requests:  10295689	Request throughput:  1.3727585333333334e+06 bit/s
Number of responses: 10295689	Response throughput: 1.3727585333333334e+06 bit/s

after

	dataFree := grpcsync.OnceFunc(d.Free)
	defer dataFree()
	df := func(v any) error {
		defer dataFree()
$ ~/sdk/go1.23.4/bin/go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary -compression=off -maxConcurrentCalls=100 -trace=off -reqSizeBytes=1 -respSizeBytes=1 -networkMode=Local -resultFile=after
go1.23.4/grpc1.71.0-dev
unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_100-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false:
50_Latency: 533.2280µs	90_Latency: 744.5700µs	99_Latency: 1415.7970µs	Avg_Latency: 584.1710µs	Bytes/op: 10225.500515690499	Allocs/op: 181.97781262126145
Histogram (unit: µs)
Count: 10246456  Min:  64.9  Max: 39638.3  Avg: 584.17
------------------------------------------------------------
[      64.876000,       64.877000)         1    0.0%    0.0%
[      64.877000,       64.882985)         0    0.0%    0.0%
[      64.882985,       64.924788)         0    0.0%    0.0%
[      64.924788,       65.216775)         0    0.0%    0.0%
[      65.216775,       67.256256)         0    0.0%    0.0%
[      67.256256,       81.501688)         3    0.0%    0.0%
[      81.501688,      181.003628)       142    0.0%    0.0%
[     181.003628,      876.007898)   9584488   93.5%   93.5%  #########
[     876.007898,     5730.495526)    661038    6.5%  100.0%  #
[    5730.495526,    39638.273000)       783    0.0%  100.0%
[   39638.273000,   276478.380825)         1    0.0%  100.0%
Number of requests:  10246456	Request throughput:  1.3661941333333333e+06 bit/s
Number of responses: 10246456	Response throughput: 1.3661941333333333e+06 bit/s

comparison

$ ~/sdk/go1.23.4/bin/go run benchmark/benchresult/main.go before after
unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_100-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps     10295689     10246456    -0.48%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     10153.14     10225.50     0.71%
           Allocs/op       178.97       181.98     1.68%
             ReqT/op   1372758.53   1366194.13    -0.48%
            RespT/op   1372758.53   1366194.13    -0.48%
            50th-Lat    530.312µs    533.228µs     0.55%
            90th-Lat    747.704µs     744.57µs    -0.42%
            99th-Lat   1.405523ms   1.415797ms     0.73%
             Avg-Lat    581.554µs    584.171µs     0.45%
           GoVersion     go1.23.4     go1.23.4
         GrpcVersion   1.71.0-dev   1.71.0-dev

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update this PR with the OnceFunc changes (which I believe we need for "correctness"*)? The results you shared aren't bad but also aren't great, but I'd like to experiment a bit.

* - Technically there is no correctness problem per-se, just re-use of memory that is lost. In theory we could do the change without the OnceFunc, but then the memory wouldn't get reused if the buffer wasn't decoded, which feels worse.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with the OnceFunc in the latest commit.

@arjan-bal arjan-bal added the Type: Performance Performance improvements (CPU, network, memory, etc) label Jan 10, 2025
@arjan-bal arjan-bal added this to the 1.70 Release milestone Jan 10, 2025
@arjan-bal arjan-bal added Type: Bug and removed Type: Performance Performance improvements (CPU, network, memory, etc) labels Jan 10, 2025
@arjan-bal arjan-bal requested a review from dfawley January 10, 2025 10:01
@dfawley dfawley assigned lqs and unassigned dfawley Jan 10, 2025
@arjan-bal arjan-bal assigned dfawley and unassigned lqs Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants