Getting Consistent Unexpected Resource Exhausted Errors #614

xanderdunn · 2023-03-22T17:59:48Z

xanderdunn
Mar 22, 2023

Thank you for supplying this fantastic library, it's been very useful to me. I'm using grpc-rs 0.12.1. Rust 1.68.0.

I have a single client, the "Controller," and I have 64 servers, the "Responders." These 64 Responders are across 8 different AWS VMs. These binaries will need to run for months on end performing their tasks. To test uptime, I'm running a simple test function in a loop, waiting for each test to finish before moving on to the next test. However, I consistently see it fail at approximately the same uptime and the same test step:

2023-03-22T12:48:07.640632Z  INFO controller: TEST RUN 2180 SUCCESS!
2023-03-22T12:48:07.738202Z  INFO controller: RAM usage: 0.57% (6900016 bytes) | Disk usage of `/` mounted disk: 34.64%
2023-03-22T12:48:07.745171Z  INFO controller: Uptime: 00d:05h:41m:06s
2023-03-22T12:48:07.745223Z  INFO controller: TEST RUN 2181...
2023-03-22T12:48:07.846348Z  INFO controller::world: 0 / 64 ranks have succeeded, 0 have failed...
2023-03-22T12:48:10.849490Z ERROR responder::responder_service: failed to reply job_id: "dba1ebe5-1ea8-494a-aa56-c156175d28fc": RemoteStopped
2023-03-22T12:48:10.849738Z ERROR controller::world: RPC call failed with error: RpcFailure(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "Rec
eived RST_STREAM with error code 11", details: [], debug_error_string: "{\"created\":\"@1679489290.849549587\",\"description\":\"Error received f
rom peer ipv4:127.0.0.1:5004\",\"file\":\"/home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.12.1+1.46.5-patched/grpc/src/
core/lib/surface/call.cc\",\"file_line\":952,\"grpc_message\":\"Received RST_STREAM with error code 11\",\"grpc_status\":8}" })
2023-03-22T12:48:10.854610Z ERROR controller::world: RPC call failed with error: RpcFailure(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "Rec
eived RST_STREAM with error code 11", details: [], debug_error_string: "{\"created\":\"@1679489290.854198401\",\"description\":\"Error received f
rom peer ipv4:172.31.34.235:5003\",\"file\":\"/home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.12.1+1.46.5-patched/grpc/
src/core/lib/surface/call.cc\",\"file_line\":952,\"grpc_message\":\"Received RST_STREAM with error code 11\",\"grpc_status\":8}" })
2023-03-22T12:48:10.870606Z ERROR controller::world: RPC call failed with error: RpcFailure(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "Rec
eived RST_STREAM with error code 11", details: [], debug_error_string: "{\"created\":\"@1679489290.870476443\",\"description\":\"Error received f
rom peer ipv4:172.31.43.85:5006\",\"file\":\"/home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.12.1+1.46.5-patched/grpc/s
rc/core/lib/surface/call.cc\",\"file_line\":952,\"grpc_message\":\"Received RST_STREAM with error code 11\",\"grpc_status\":8}" })
2023-03-22T12:48:10.874474Z ERROR controller::world: RPC call failed with error: RpcFailure(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "Rec
eived RST_STREAM with error code 11", details: [], debug_error_string: "{\"created\":\"@1679489290.874308616\",\"description\":\"Error received f
rom peer ipv4:172.31.37.200:5003\",\"file\":\"/home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.12.1+1.46.5-patched/grpc/
src/core/lib/surface/call.cc\",\"file_line\":952,\"grpc_message\":\"Received RST_STREAM with error code 11\",\"grpc_status\":8}" })
2023-03-22T12:48:10.875536Z ERROR controller::world: RPC call failed with error: RpcFailure(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "Rec
eived RST_STREAM with error code 11", details: [], debug_error_string: "{\"created\":\"@1679489290.875404491\",\"description\":\"Error received f
rom peer ipv4:172.31.37.200:5004\",\"file\":\"/home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.12.1+1.46.5-patched/grpc/
src/core/lib/surface/call.cc\",\"file_line\":952,\"grpc_message\":\"Received RST_STREAM with error code 11\",\"grpc_status\":8}" })
2023-03-22T12:48:10.876781Z ERROR controller::world: RPC call failed with error: RpcFailure(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "Rec
eived RST_STREAM with error code 11", details: [], debug_error_string: "{\"created\":\"@1679489290.876578371\",\"description\":\"Error received f
rom peer ipv4:172.31.37.200:5005\",\"file\":\"/home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.12.1+1.46.5-patched/grpc/
src/core/lib/surface/call.cc\",\"file_line\":952,\"grpc_message\":\"Received RST_STREAM with error code 11\",\"grpc_status\":8}" })
2023-03-22T12:48:10.877992Z ERROR controller::world: RPC call failed with error: RpcFailure(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "Rec
eived RST_STREAM with error code 11", details: [], debug_error_string: "{\"created\":\"@1679489290.877910972\",\"description\":\"Error received f
rom peer ipv4:172.31.37.200:5007\",\"file\":\"/home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.12.1+1.46.5-patched/grpc/
src/core/lib/surface/call.cc\",\"file_line\":952,\"grpc_message\":\"Received RST_STREAM with error code 11\",\"grpc_status\":8}" })

From here, RESOURCE_EXHAUSTED indicates "Some resource has been exhausted, perhaps a per-user quota, or perhaps the entire file system is out of space." The message indicates "RST_STREAM with error code 11". This seems to be an HTTP/2_ERROR_ENHANCE_YOUR_CALM error, which is supposed to mean that the server is overloaded with requests?

My previous run failed with the same errors here:

2023-03-22T05:58:58.927404Z  INFO controller: Uptime: 00d:05h:40m:47s
2023-03-22T05:58:58.927457Z  INFO controller: TEST RUN 2181...

You can see it failed at the exact same test step, 2181, and only seconds apart in terms of total uptime. It's very suspicious that it always fails at almost exactly the same point in time.

So, is there a resource being exhausted? Each of the 8 VMs has 1.10T of RAM, and you can see above that only 0.57% (6900016 bytes) is in use, which includes the Controller and 8 Responders. I checked disk usage on each machine, which has a 134GB root drive:

Filesystem       Size  Used Avail Use% Mounted on
/dev/root        134G   42G   92G  32% /

They all have plenty of free space.

The 64 Responder servers each start like this:

    let num_threads = 8;
    let env = Arc::new(Environment::new(num_threads));
    let world = ResponderWorld::new();
    let service = create_responder(ResponderService {
        world: Arc::new(Mutex::new(world)),
    });
    let addr = format!("0.0.0.0:{}", args.port);

    let quota = ResourceQuota::new(Some("ServerQuota")).resize_memory(1024 * 1024);
    let ch_builder = ChannelBuilder::new(env.clone())
        .set_resource_quota(quota)
        .max_concurrent_stream(8);

    let mut server = ServerBuilder::new(env)
        .register_service(service)
        .channel_args(ch_builder.build_args())
        .requests_slot_per_cq(1)
        .build()
        .unwrap();
    server
        .add_listening_port(addr.clone(), ServerCredentials::insecure())
        .unwrap();
    server.start();
    info!("listening on {addr} with {num_threads} threads...");
    loop {
        // run forever
        thread::park();
    }

The Controller client creates its 64 connections, one to each Responder, like this:

                let env = Arc::new(EnvBuilder::new().cq_count(2).build());
                let machine_id = machine_id as u32;
                let base_port = ports.get(&ip).map(|&port| port + 1).unwrap_or(5000);
                ports.insert(ip.clone(), base_port + num_devices - 1);

                let clients: Vec<(ResponderClient, u32)> = (0..num_devices)
                    .map(|device_id| {
                        let port = base_port + device_id;
                        let address = format!("{}:{}", ip, port);
                        info!("Connecting to {address} with {num_devices} devices...");
                        let ch = ChannelBuilder::new(env.clone())
                            .enable_retry(true)
                            .connect(address.as_str());
                        (ResponderClient::new(ch), port)
                    })
                    .collect();

I don't ever expect any more than 2 simultaneous requests to any given Responder. One to run a test and one to request the status of a particular job_id.

I don't expect this is enough information to fully diagnose the issue, but I would greatly appreciate any ideas, thoughts, or potential debugging steps from anyone who has more experience with grpc-rs. Thank you!

Answered by BusyJay

Mar 23, 2023

So why did you set the memory quota to 1024 * 1024 = 1MiB? This is a quota shared by all channels/calls. I suggest to either remove the quota or increase it to a bigger value and then retry.

View full answer

BusyJay · 2023-03-23T11:39:16Z

BusyJay
Mar 23, 2023
Maintainer

So why did you set the memory quota to 1024 * 1024 = 1MiB? This is a quota shared by all channels/calls. I suggest to either remove the quota or increase it to a bigger value and then retry.

3 replies

xanderdunn Mar 23, 2023
Author

Thanks! There's no good reason for that quota. I just copy-pasted the code from the helloworld example. I'll remove this and retry. In the absence of any defined quota, is the server's memory usage unbounded, or is there a default limit?

BusyJay Mar 24, 2023
Maintainer

There will be a default quota and the quota is almost unbounded.

xanderdunn Mar 24, 2023
Author

That's all it was! My test runs indefinitely now, now issues with sudden failures around 5hr:40min of uptime. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Consistent Unexpected Resource Exhausted Errors #614

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Getting Consistent Unexpected Resource Exhausted Errors #614

xanderdunn Mar 22, 2023

Replies: 1 comment · 3 replies

BusyJay Mar 23, 2023 Maintainer

xanderdunn Mar 23, 2023 Author

BusyJay Mar 24, 2023 Maintainer

xanderdunn Mar 24, 2023 Author

xanderdunn
Mar 22, 2023

Replies: 1 comment 3 replies

BusyJay
Mar 23, 2023
Maintainer

xanderdunn Mar 23, 2023
Author

BusyJay Mar 24, 2023
Maintainer

xanderdunn Mar 24, 2023
Author