- Author(s): erikjoh & fabric
- Approver: markdroth
- Status: Implementation in progress
- Implemented in: Java in progress
- Last updated: 2021-12-02
- Discussion at: https://groups.google.com/g/grpc-io/c/4qycdcFfMUs
Add support for least request load balancing configured via xDS.
In multi-tenancy environments like for example Kubernetes, some upstream endpoints may perform much slower than others. When using plain round-robin load balancing, this becomes a problem since all endpoints get an equal amount of RPCs even when some endpoints perform worse. The result is increased tail latencies and in extreme cases even increased error rates. As an alternative to round-robin, Envoy provides a "least request" load balancing policy which uses per-endpoint outstanding/active request counters as a heuristic for determining load on each endpoint. The Envoy least request implementation uses two different algorithms depending on whether endpoints have non-equal weights or not. In the "all weights equal" case, Envoy samples N random available hosts and selects the one with fewest outstanding requests. In the "all weights not equal" case, Envoy switches to a weighted round robin schedule where weights are dynamically adjusted based on the outstanding requests.
This proposal builds on earlier work described in the following gRFCs:
Support LEAST_REQUEST
LB policy and configuration of it using xDS.
Specifically, gRPC will only support the "all weights equal" least request algorithm as
defined by Envoy.
The xDS Cluster
resource specifies the load balancing policy to use
when choosing the endpoint to which each request is sent. The policy to
use is specified in the lb_policy
field.
Prior to this proposal, gRPC supported two values for this field,
ROUND_ROBIN
and RING_HASH
. With this proposal, we will add support for an
additional policy, LEAST_REQUEST
.
The configuration for the Least Request LB policy is the
least_request_lb_config
field.
The field is optional; if not present, defaults will be assumed for all
of its values. gRPC will support the
choice_count
field the same way that Envoy does with the exception of allowed values
(More details further down in LB Policy Config).
The
active_request_bias
field will be ignored as it is only used in the "all weights not equal" (weighted least request)
case which will not be supported as part of this proposal.
The
slow_start_config
field will also be ignored as it requires the weighted least request algorithm.
As a result of
gRFC A42: xDS Ring Hash LB Policy,
the xds_cluster_resolver
policy implementation will need additional logic to understand the setup for LEAST_REQUEST
in addition to ROUND_ROBIN
and RING_HASH
.
For LEAST_REQUEST
, the policy name will be "LEAST_REQUEST"
, and the config will be the one for the least_request_experimental
LB Policy described below.
When the xds_cluster_resolver
policy sees "LEAST_REQUEST"
, it will also assume that the
locality-picking policy is weighted_target
and the endpoint-picking policy is least_request_experimental
.
We will implement a least_request_experimental
LB policy that uses the same
algorithm as Envoy's implementation. However, this policy will be
implemented in a non-xDS-specific way, so that it can also be used without
xDS in the future.
The least_request_experimental
policy will have the following config:
message LeastRequestLoadBalancingConfig {
uint32 choice_count = 1; // Optional, defaults to 2.
}
The choice_count
determines the number of randomly sampled subchannels which should be compared
for each pick by the picker.
Envoy currently supports any choice_count
value greater than or equal to two.
In the least_request_experimental
policy however, only an effective choice_count
in the
range [2, 10]
will be supported. If a LeastRequestLoadBalancingConfig
with a choice_count > 10
is received, the least_request_experimental
policy will set choice_count = 10
. If choice_count < 2
, the
config will be rejected.
The subchannel connectivity management implemented in the round_robin
policy will be re-used
for least_request_experimental
.
Implementations may break out the connectivity management logic and have round_robin
and least_request_experimental
inherit it instead. This would be similar to the "base balancer" already implemented in grpc-go
here.
Just like the round_robin
policy, the least_request_experimental
policy will attempt to keep a subchannel connected
to every address in the address list at all times. If a subchannel becomes disconnected, the policy will trigger
re-resolution and will attempt to reconnect to the address, subject to connection back-off.
The least_request_experimental
policy will use the same heuristic for determining the aggregated
connectivity state as the round_robin
policy.
The general rules in order of precedence are:
- If there is at least one subchannel with the state
READY
, the aggregated connectivity state isREADY
. - If there is at least one subchannel with the state
CONNECTING
orIDLE
, the aggregated connectivity state isCONNECTING
. - If all subchannels have the state
TRANSIENT_FAILURE
, the aggregated connectivity state isTRANSIENT_FAILURE
.
For the purpose of evaluating the aggregated connectivity state in the rules above, a subchannel that enters
TRANSIENT_FAILURE
should be considered to stay in TRANSIENT_FAILURE
until it reports READY
. If a subchannel is
bouncing back and forth between TRANSIENT_FAILURE
and CONNECTING
while it attempts to re-establish a connection,
we will consider it to be in state TRANSIENT_FAILURE
for the purpose of evaluating the aggregation rules above.
The least_request_experimental
policy will associate each subchannel to an outstanding request counter
(active_requests
). Counters will be local to each instance of the least_request_experimental
policy and
thereby not shared in cases where subchannels are used across multiple LB instances.
Implementations may either wrap subchannels or utilize custom subchannel attributes to provide counter access
in the picker.
These counters should additionally share the lifecycle with its corresponding subchannel.
The counter for a subchannel should be atomically incremented by one after it has been successfully
picked by the picker. Due to language-specifics, the increment may occur either before or during stream creation.
In the PickResult
for each language-specific implementation (
Java,
C++,
Go
), the picker should add a
callback for atomically decrementing the subchannel counter once the RPC finishes (regardless of Status
code).
This approach closely resembles the implementation for outstanding request counters already present in
the cluster_impl_experimental
policy (
Java,
C++,
Go
).
This approach entails some degree of raciness which will be discussed later
(See Outstanding request counter raciness).
The picker should be given a list of all subchannels with the READY
state.
With this list, the picker should pick a subchannel based on the following pseudo-code:
candidate = null;
for (int i = 0; i < choiceCount; ++i) {
sampled = subchannels[random_integer(0, subchannels.length)];
if (candidate == null) {
candidate = sampled;
continue;
}
if (sampled.active_requests < candidate.active_requests) {
candidate = sampled;
}
}
return candidate;
This pseudo-code is mirroring the Envoy implementation
In gRPC, if an address is present more than once in an address list, that is intended to indicate that the address should be more heavily weighted. However, it should not result in more than one connection to that address.
In keeping with this, the least_request_experimental
policy will de-duplicate the addresses in the address list
it receives so that it does not create more than one connection to a given address. However, because the policy
will not initially support the weighted least request algorithm, we will ignore the additional weight given to
the address. If at some point in the future we do add support for the weighted least request algorithm, then an
address that is present multiple times in the list will be given appropriate weight in that algorithm.
During initial development, this feature will be enabled via the
GRPC_EXPERIMENTAL_ENABLE_LEAST_REQUEST
environment variable. This
environment variable protection will be removed once the feature has
proven stable.
While also supporting the "all weights not equal" (Weighted Least Request) case
from Envoy
would enable more advanced use cases, there are some missing pieces in gRPC that
are required for that to be technically feasible.
For example, gRPC ignores the per-endpoint load_balancing_weight
field as a result
of the implementation described by
gRFC A27: xDS-based Global Load Balancing.
Later in the future when ORCA
is supported, gRPC may be able to locally provide per-endpoint weights that could be
used in a weighted least request policy implementation.
One possible downside of the Envoy implementation, is that the random selection will allow a subchannel to be sampled multiple times and thereby be unavoidably picked with some probability. This means that even a frozen endpoint would never fully be excluded by the picker. There exists a closed PR to Envoy that attempted to solve this. In that PR, there were some questions around whether there would be any significant improvement for >5 endpoints.
While reading the outstanding request counters of samples in the picker, previously read values may become outdated.
For cases where a high choice_count
relative the number of endpoints is used, and/or cases with high/bursty traffic,
there may be sub-optimal behaviors where certain endpoints may get an unfair amount of traffic.
This raciness is currently accepted in the Envoy implementation.
The fix for this raciness would most likely have to be a limitation on picker concurrency.
Since such a limitation could have a serious negative impact on more normal load pattern situations,
it may be argued that it's better to accept some degree of raciness in the picker instead.
In gRPC, there are cases where LB config cannot be trusted. This means that config may contain an unreasonable
(for example UINT_MAX
) value for the choice_count
setting.
Thereby, it would be beneficial for gRPC to introduce a max value for the setting.
The default value choice_count = 2
comes from the P2C (power of two choices) paper
(Mitzenmacher et al.
which shows that such a configuration nearly is as good as an O(n) full scan.
With that in mind, picking a choice_count > 2
would probably not add much benefit in many cases.
By introducing a limit as 2 <= choice_count <= 10
, we disallow potentially dangerous LB configurations
while still allowing some degree of freedom that could be beneficial in some edge-cases.
Spotify is able to contribute towards a Java implementation. Implementation of remaining gRPC languages is left for respective gRPC team or any other contributors.