async redis cluster should use initial startup nodes during reinitialization in case of failover #2472

artesby · 2022-11-26T19:36:42Z

Version:
redis-py: 4.3.3
redis: 7.0.2

Platform:
python 3.7

Description:
Async implementation of RedisCluster overwrites initial startup nodes during first initialization. This is caused by this line.

When deployed in dynamic systems like Kubernetes your shards may sometimes change their IP addresses.
This likely leads to a situation when python client ends up with a pool of invalid (outdated) startup nodes and can not perform reinitialization, since NodeManager uses raw IP addresses.

For example, let's say we have cluster with 3 masters in k8s and use python client this way:

from redis.asyncio import RedisCluster

client = RedisCluster.from_url('redis://redis-cluster-headless:6379/0')
print(client.nodes_manager.startup_nodes)

# outputs {'redis-cluster-headless:6379': [host=redis-cluster-headless, port=6379, name=redis-cluster-headless:6379, server_type=None]}

await client # initialization
print(client.nodes_manager.startup_nodes)

# outputs something like this
# {'<ip1>:6379': [host=<ip1>, port=6379, name=<ip1>:6379, server_type=primary],
#  '<ip2>:6379': [host=<ip2>, port=6379, name=<ip2>:6379, server_type=primary],
#  '<ip3>:6379': [host=<ip3>, port=6379, name=<ip3>:6379, server_type=primary]}

now image your python client is idle for some time without requests, and during that period K8S have recreated each redis container due to whatever reason (crash, or moved to another node by autoscaler for example).

Then, next time python client will try to perform a redis call it will find out that all ClusterNodes are unavaliable, will try to reinitialize NodeManager and end up with no connectable node to get new cluster configuration from.
This would not be an issue if initial headless URL was still in the pool of startup nodes.

Sync implementation of RedisCluster preserves initial node address just as suggested above, so i believe async implementation should behave the same.

For now I use the following patch, but I can't call it an elegant way to get around the problem. It must be solved through changes to NodeManager.

class AsyncRedisCluster(RedisCluster):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._initial_startup_nodes = dict(self.nodes_manager.startup_nodes)

    async def initialize(self):
        rv = await super().initialize()
        self.nodes_manager.startup_nodes.update(self._initial_startup_nodes)
        return rv

P.S. sync implementation has the opposite problem, NodesManager doesn't pop unused addresses from startup_nodes storage => it may grow indefinitely as soon as your containers change IPs.

The text was updated successfully, but these errors were encountered:

utkarshgupta137 · 2022-11-29T11:56:29Z

I'll be happy to review any PRs with a fix.

jiyeonseo · 2023-02-12T02:20:11Z

Hi! Is there any update on this? I got the same issue when I use AWS MemoryDB.
got below exception(the same one mentioned above) when all the node was down and recovered.

Redis Cluster cannot be connected. Please provide at least one reachable node: None

I thought doing some reinitialize would be working, but it was not.

shuyingz · 2023-02-20T14:33:32Z

Hi! I got the same issue too when I use AWS redis. Exactly the same exception as Jiyeonseo posted:

"redis_error": "Redis errors out with exception: Redis Cluster cannot be connected. Please provide at least one reachable node: None"

Is there any updates on this issue? Thanks!

thundergolfer · 2023-10-29T22:44:57Z

Heads up that the "this line" link in the original code description pointed at main and has thus become stale.

Given that state of the repository at the time the issue was created, the line in question is

redis-py/redis/asyncio/cluster.py

Line 1337 in d3a3ada

self.set_nodes(self.startup_nodes, self.nodes_cache, remove_old=True)

andr-c · 2024-05-21T11:55:03Z

Seems like adding dynamic_startup_nodes=False in cluster constructor solves the problem, at least for sync client.

uglide added async cluster labels Feb 8, 2023

kuza55 mentioned this issue May 2, 2024

RedisCluster becomes unrecoverable if all nodes timeout #3221

Open

Kakadus linked a pull request Nov 28, 2024 that will close this issue

Add dynamic_startup_nodes parameter to async RedisCluster #3447

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async redis cluster should use initial startup nodes during reinitialization in case of failover #2472

async redis cluster should use initial startup nodes during reinitialization in case of failover #2472

artesby commented Nov 26, 2022 •

edited

Loading

utkarshgupta137 commented Nov 29, 2022

jiyeonseo commented Feb 12, 2023

shuyingz commented Feb 20, 2023 •

edited

Loading

thundergolfer commented Oct 29, 2023

andr-c commented May 21, 2024

async redis cluster should use initial startup nodes during reinitialization in case of failover #2472

async redis cluster should use initial startup nodes during reinitialization in case of failover #2472

Comments

artesby commented Nov 26, 2022 • edited Loading

utkarshgupta137 commented Nov 29, 2022

jiyeonseo commented Feb 12, 2023

shuyingz commented Feb 20, 2023 • edited Loading

thundergolfer commented Oct 29, 2023

andr-c commented May 21, 2024

artesby commented Nov 26, 2022 •

edited

Loading

shuyingz commented Feb 20, 2023 •

edited

Loading