Getting Consistent Unexpected Resource Exhausted Errors #614
-
Thank you for supplying this fantastic library, it's been very useful to me. I'm using grpc-rs 0.12.1. Rust 1.68.0. I have a single client, the "Controller," and I have 64 servers, the "Responders." These 64 Responders are across 8 different AWS VMs. These binaries will need to run for months on end performing their tasks. To test uptime, I'm running a simple
From here, My previous run failed with the same errors here:
You can see it failed at the exact same test step, 2181, and only seconds apart in terms of total uptime. It's very suspicious that it always fails at almost exactly the same point in time. So, is there a resource being exhausted? Each of the 8 VMs has 1.10T of RAM, and you can see above that only
They all have plenty of free space. The 64 Responder servers each start like this: let num_threads = 8;
let env = Arc::new(Environment::new(num_threads));
let world = ResponderWorld::new();
let service = create_responder(ResponderService {
world: Arc::new(Mutex::new(world)),
});
let addr = format!("0.0.0.0:{}", args.port);
let quota = ResourceQuota::new(Some("ServerQuota")).resize_memory(1024 * 1024);
let ch_builder = ChannelBuilder::new(env.clone())
.set_resource_quota(quota)
.max_concurrent_stream(8);
let mut server = ServerBuilder::new(env)
.register_service(service)
.channel_args(ch_builder.build_args())
.requests_slot_per_cq(1)
.build()
.unwrap();
server
.add_listening_port(addr.clone(), ServerCredentials::insecure())
.unwrap();
server.start();
info!("listening on {addr} with {num_threads} threads...");
loop {
// run forever
thread::park();
} The Controller client creates its 64 connections, one to each Responder, like this: let env = Arc::new(EnvBuilder::new().cq_count(2).build());
let machine_id = machine_id as u32;
let base_port = ports.get(&ip).map(|&port| port + 1).unwrap_or(5000);
ports.insert(ip.clone(), base_port + num_devices - 1);
let clients: Vec<(ResponderClient, u32)> = (0..num_devices)
.map(|device_id| {
let port = base_port + device_id;
let address = format!("{}:{}", ip, port);
info!("Connecting to {address} with {num_devices} devices...");
let ch = ChannelBuilder::new(env.clone())
.enable_retry(true)
.connect(address.as_str());
(ResponderClient::new(ch), port)
})
.collect(); I don't ever expect any more than 2 simultaneous requests to any given Responder. One to run a test and one to request the status of a particular job_id. I don't expect this is enough information to fully diagnose the issue, but I would greatly appreciate any ideas, thoughts, or potential debugging steps from anyone who has more experience with grpc-rs. Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
So why did you set the memory quota to 1024 * 1024 = 1MiB? This is a quota shared by all channels/calls. I suggest to either remove the quota or increase it to a bigger value and then retry. |
Beta Was this translation helpful? Give feedback.
So why did you set the memory quota to 1024 * 1024 = 1MiB? This is a quota shared by all channels/calls. I suggest to either remove the quota or increase it to a bigger value and then retry.