-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
running tictac_aae in non-native mode leads to eventual failure of node #1765
Comments
I'm going to have a look at this from tomorrow. Most of the volume testing I did was in native mode not parallel, so the first aim is to look at what happens if I build up vnodes of >> 100GB of data with an eleveldb backend - and then then switch on tictacaae. I'm hoping to keep building up the volume until I hit issues with aae store rebuilds timing out, aae cache rebuilds timing out and maybe fetch_clocks queries. |
Tested this with 200GB per-vnode (running leveldb backend, but with a small ring size) which had been built with AAE enabled but not tictac_aae. The intention was to test what happens when we then restart all the nodes at the same time with tictac_aae enabled (without the standard transition precaution of increasing the exchange tick). No crashes were observed in the aftermath of this event, however, there were warning signs of potential trouble ahead. When the restart occurs - each vnode/aae_controller spots that there is a state mismatch between the vnode (non-empty) and the parallel aae store (empty). This will prompt a rebuild of the aae keystore to resolve this mismatch. This is as expected. There is minimal spacing of these rebuilds - so all nodes become busy concurrently, doing all their rebuilds. The busyness is primarily with write activity, in this environment there was no noticeable impact on read_await, but a significant impact on write_await. It is assumed this is because the vnode backend by default is greedy at allocating itself memory, so most of the reads as the rebuild folds over the leveldb store are from memory (or from the page cache reading ahead). The high volume of writes are an unavoidable consequence from the amplification of writes necessary to build a LSM store. In this environment, the rebuilds of the aae store took about 1 hour. However, there were only 4 vnodes per node, so in a normal environment this could have taken much longer - as disk utilisation was at 80% with 4 concurrent rebuilds. Once the aae store rebuilds are complete, an aae treecache rebuild is prompted for each vnode. This requires a fold over the new aae_keystore in order to build the aae treecache. These aae treecache rebuilds are prompted almost concurrently, but the work is scheduled using the vnode's "Best Effort" queue - and there is by default only one concurrent item of work allowed to be worked from this queue per node. At this stage, exchanges are occurring, and completing with no deltas - as all treecaches are empty (as none have been rebuilt). After 20-30 minutes, the first treecaches are built, and the next treecache build for the node is picked up off the queue. At this stage, some aae exchanges begin to see false deltas when comparing trees - where they are comparing a completed tree with a still empty tree. These false deltas require aae_folds over the whole store (albeit an efficient fold which is skipping over slots based on the aae segment index). This adds more read activity to the keystore, and seems to slow further the next treecache rebuilds. Some of the aae_exchanges begin to timeout, waiting 10 minutes for the fold_clocks query to complete. As the treecaches begin to get built across the cluster, the number of false delta hunts first increases, then decreases - as does the number of exchange timeouts. Eventually (after about 4 hours), the treecaches are all built, and false delta hunts stop altogether. The cluster at this stage is behaving as expected. |
There are some signs of trouble:
All of this could be mitigated with configuration for this transition scenario. A temporary increase in the exchange tick time, and also an increase in the size of the BE queue. However, the problem of overlapping exchanges timing out could occur outside of this scenario in a high-entropy cluster. |
There are obvious improvements to make:
|
The problem of rebuild scheduling is hard. Not queueing rebuilds may increase pressure, and cause other issues. Scheduling them further apart would increase the window for false positives. Perhaps false positives caused by initial rebuilds could be handled by comparing rebuild state at startup. There are potential solutions here, but none without notable additional complexity or alternative risks. |
I don't know what is happening wrt logging. |
Some thoughts on an algorithm through which the system can back-off on exchanges, should exchanges start to take a long time.
|
Never calling |
Proposal is:
This should resolve the problem of cascading load on slow exchanges. The issue of queueing treecache rebuilds and snapshot timeouts still needs to be resolved. |
https://github.com/basho/riak_kv/tree/mas-i1765-parallel Presently this only addresses the issue of needing to back off as the time to complete exchanges increases. This works well, at the moment - as the exchange time goes up, at the peak where 50% of vnodes have rebuilt but 50% have not, the frequency of exchanges is reduced by just 30%. With this reduction, the maximum time to complete an exchange reduces by more than 50%, with no exchange getting within 3 minutes of the old threshold for timeout. |
The branch has now been extended so that the taking of the rebuild snapshot is deferred until the rebuild work item is consumed from the queue - so as long as the rebuild of the trees completes within 12 hours, there will be no issues with the snapshot timing out during the rebuild. The scenario above was then re-tested with equivalent (although slightly improved) results. |
Hi Martin, thanks for detailing the issue, Would allowing additional capacity to accomodate the inital treecache build, alleviate this issue ? Could you also please share the specs of the nodes used for testing? Yea logging in general for riak could use some work, I'll create an issue if one doesn't exist alread. |
The test rig is 8 x servers, each with 8 x large HDD (RAID 10) and 12 CPU cores. There is a FBWC on the RAID controller for improved write latency. Ran a more challenging version of the test, with 1.4TB per node of data, with about 4K ops/persec (90% writes) load during the test. This test also included a server failure mid-test (not deliberate, but a kernel panic). Everything went OK with the new code, until I tried recovering the failed server. Then the combination of hinted handoffs, the backlog of rebuilds on the server recreated the memory-leak problem. This exposed there is an issue with the aae_runner in kv_index_tictactree having an unbounded queue. The issue with slow fetch_clocks queries is related more to queue time at the aae_runner, than it is to query time. There is a need for a further fix to bound the queue on aae_runner, so as not to take snapshots of the store and place the snapshot-related work at the end of the queue. |
It should be noted, that the process of enabling tictacaae for the first time, on a large store, when under heavy write load - will consume significant amounts of memory. Each aae_keystore keeps a "change_queue" of Keys/Metadata of updates received since the rebuild of the store started. Since all rebuilds are running concurrently when tictacaae is enabled for the first time - this can require lots of memory. In the above test an additional 4GB per node was required by the end of the 3 to 4 hour period of the keystore rebuilds. To reduce the ram required, enabling tictacaae must happen away from periods of heavy write load. |
masleeds 3:21 AM
I mean tictac_aae in parallel mode - so leveldb or bitcask as storage_backend, but leveled as AAE backend. The other user had problems when they had a cluster with busy vnodes, and they switched tictac_aae on - and the rebuilds/repair cycle that got triggered as tictac_aae was trying to get up-to-speed went into a downward spiral
https://postriak.slack.com/archives/C6R0LPH4N/p1593349532100600
The text was updated successfully, but these errors were encountered: