-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory fragmentation #1820
Comments
Changing the single block carrier threshold to a lower value on the eheap_alloc has a very negative impact. This change does not help erlang.eheap_memory.sbct = 128
|
The memory manipulation had an impact, but a relatively minor impact. Most of the memory being used by the BEAM in this test is as a result of leveled consuming memory for caches. Testing with a change to the leveled_sst cache to periodically release the blockindex_cache when not busy (and also have more GC triggers within the code), changing SST caching has a bigger difference: Get requests are unevenly spread around the cluster with leveled, the GET request will be directed to the node which responded fastest to the HEAD request. So by examining how the GET requests were distributed, we can see which nodes were performing better during the test: Now looking at CPU by node, there is very little difference, so the "sst+ramv" node was doing more work and doing that work with relative efficiency: Following this point in the test, one node had issues, and scanning of the full store for either intra or inter cluster AAE increased. Those nodes with the sst change then showed higher disk utilisation (as block index caches were fetched from disk), but maintained their memory advantage. Hypothesis is that releasing and regaining the cache has disk costs, but reduces fragmentation. |
Wouldn't it be appropriate, in addition to exposing various parameters affecting various allocators (https://github.com/basho/riak/tree/mas-i1820-controls), to expose flags to enable/disable specific allocators as such? My reasoning is that, since
we can hope to minimize any adversarial effects of having multiple allocators in active use simultaneously. I can imagine there exist such allocation patterns (say, of binaries vs whatever objects VM thinks are 'short-lived') that will tend to induce fragmentation more easily/reliably. Disabling those special-purpose allocators will then make it easier for the default allocator (sys_alloc) to deal with fragmentation as it will see the more complete map of allocated blocks and free space. The default allocator maps to libc malloc, which I think has seen sufficient polish and fragmentation proofing (but don't call me out on that though). |
As well as experimenting with memory settings, I've also done some experimentation with scheduler settings. It was noticeable with changes in scheduler settings I was getting variances in memory usage ... and sometimes improvements when compared to different memory settings. The scheduler changes have shown that by default too many schedulers are starting, and this is causing a huge number of context switches, and consequently higher CPU (and presumably inefficient usage of memory caches). What is interesting here is the non-default settings that basho introduced to counter problems with scheduler collapse in R16 seem to be causing issues, and that we get better behaviour with current stock erlang settings. I have some more tests to run, before I have clear answers, but currently I'm considering if supporting changes here may cause operational overheads ... and overall we're better sticking to well tested beam defaults for whatever version we're running. When I start looking at all factors combined (memory, CPU use, latency, CPU efficiency), the default settings start to look more desirable. Will post-up results as I get them. @hmmr - so with regards to exposing disabling of allocators, I'm not sure. I might not expose/test that at this stage .. and in general if I think it looks like moving towards rather than away from beam defaults is the right way to go, I would be cautious about doing it at all. |
One interesting thing that has been seen, is in the test 1 of the 4 nodes is put under greater pressure than the other nodes. In this case we get this strange situation where msacc shows all schedulers very busy, but the OS does not show the equivalent CPU business. So on the node top reports the equivalent of about 14 busy CPUs:
But looking at microstate accounting at this time:
There are 18 schedulers that are only sleeping around 6% of the time, and 12 dirty_io schedulers that are only sleeping 20% of the time ... which amounts to more than 2600% CPU!!! When the node is in this state, although it is reporting a lot of CPU time spent on GC, it isn't clear GC is occurring as the largest processes my memory usage have large amounts of collectable garbage. It is in this state that we can see the memory usage by the beam on the node increasing. |
during the same test, at the same time, on a "healthy" node in the cluster:
On the"healthy" node a lot of the msacc reported CPU busyness is in other - which includes IO wait time, so accounting for that there seems to be a closer match between OS and beam reported CPU (these are 24 vCPU nodes). |
Prior to reducing the scheduler counts, the problem described above got much worse in some tests - i.e. schedulers busy, but reported CPU not as high, memory use increasing. In some cases some long-running tasks appeared to almost stop altogether, and memory usage increased rapidly to 40GB. |
A summary of the current understanding of memory issues.
The following improvements are therefore to be added to 3.0.10:
|
1 - 4 are implemented in riak 3.0.10. Also in 3.0.10, another significant cause of memory issues in leveled was detected - martinsumner/leveled#377. With these fixes, fix (5) has been deferred - martinsumner/leveled#376. It does look very likely that memory issues are not VM-related, they are all in effect leveled bugs. Also in 3.0.10 are a series of fixes to use disk-backed queues rather than pure memory queues to avoid another area where memory can run out of control. |
Current issue on a production cluster running with 30GB in use according to The issue is focused on the single-block carrier threshold, with fragmentation running at < 20% across all (48 in this case) eheap allocators e.g.
Looking to experiment with https://www.erlang.org/doc/man/erts_alloc.html#M_rsbcst - and reverse out recommendations about lowering sbct. |
There have been reports from users of Riak suffering memory fragmentation with large clusters with leveled backends.
This appears to be an issue with both OTP 22.3 as well as OTP R16basho.
Primary indicator is:
recon_alloc:memory(usage).
Which can be between 50% and 80%.
This is not universal, some leveled/Riak clusters do not see this. However, on clusters that do see it there is limited mitigation other than restarting nodes. The excess memory takes from the Page Cache, and so can have a direct impact on performance. In extreme cases unexpected and sudden memory exhaustion has been observed.
Horizontal scaling currently appears to offer better mitigation than vertical scaling.
The fragmentation is primarily with the
eheap_alloc
andbinary_alloc
and associated with inefficient use of multiblock carriers, primarily these carriers not seeming to ever transition to empty and be reclaimed.The text was updated successfully, but these errors were encountered: