Data Lag #572

akurniawan · 2016-08-02T05:59:22Z

Hi guys, I'm having a bit of difficulty to understand what is the root cause of data lagging in my graphite cluster. Sometimes the new data won't show up until 1 minute or so. I have already upgraded the instances in my cluster, I don't think there are any hardware bottleneck at the moment.
Moreover, I look at my relays metrics, and the number of queue is quite high, around 6k.
What should I do to decrease the number of queue? Is that related to the data lagging I experience?
I tried to find the documentation explaining about every carbon metrics, but I couldn't find it. Does anybody has any reference about this metrics?

**STORAGE-SCHEMAS**
[carbon]
pattern = ^carbon\.
retentions = 60:90d

[default_1min_for_1day]
pattern = .*
retentions = 10s:6h,1m:30d,1d:90d

**STORAGE-AGGREGATION**
[default]
pattern = .*
xFilesFactor = 0.0
aggregationMethod = average

[min]
pattern = \.min$
xFilesFactor = 0.0
aggregationMethod = min

[max]
pattern = \..*(max|p90|p95)$
xFilesFactor = 0.0
aggregationMethod = max

[sum]
pattern = \..*(count|curr_op_\d*_s|faults|mongo_q[wr])$
xFilesFactor = 0.0
aggregationMethod = sum

[avg]
pattern = (\..*(avg|average|_mongo_write_lock)$|(.*cpu.*|.*load\.|\.bytes_per_second\.(write|read)$|\.memory\.free$))
xFilesFactor = 0.0
aggregationMethod = average

[last]
pattern = \.percent_bytes\.free$
xFilesFactor = 0.0
aggregationMethod = last

**CARBON.CONF**
[cache]
LOCAL_DATA_DIR = /whisper
ENABLE_LOGROTATION = True
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 50000
MAX_CREATES_PER_MINUTE = 10000
LINE_RECEIVER_INTERFACE = 127.0.0.1
ENABLE_UDP_LISTENER = False
PICKLE_RECEIVER_INTERFACE = 127.0.0.1
LOG_LISTENER_CONNECTIONS = False
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 127.0.0.1
USE_FLOW_CONTROL = True
LOG_UPDATES = True
LOG_CACHE_HITS = True
LOG_CACHE_QUEUE_SORTS = True
CACHE_WRITE_STRATEGY = sorted
WHISPER_AUTOFLUSH = False
WHISPER_FALLOCATE_CREATE = True
WHISPER_LOCK_WRITES = False
CARBON_METRIC_PREFIX = carbon
CARBON_METRIC_INTERVAL = 60
[cache:0]
LINE_RECEIVER_PORT = 2013
PICKLE_RECEIVER_PORT = 2014
CACHE_QUERY_PORT = 7002
[cache:1]
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_PORT = 2024
CACHE_QUERY_PORT = 7102
[cache:2]
LINE_RECEIVER_PORT = 2033
PICKLE_RECEIVER_PORT = 2034
CACHE_QUERY_PORT = 7202
[cache:3]
LINE_RECEIVER_PORT = 2043
PICKLE_RECEIVER_PORT = 2044
CACHE_QUERY_PORT = 7302
[cache:4]
LINE_RECEIVER_PORT = 2053
PICKLE_RECEIVER_PORT = 2054
CACHE_QUERY_PORT = 7402
[cache:5]
LINE_RECEIVER_PORT = 2063
PICKLE_RECEIVER_PORT = 2064
CACHE_QUERY_PORT = 7502
[cache:6]
LINE_RECEIVER_PORT = 2073
PICKLE_RECEIVER_PORT = 2074
CACHE_QUERY_PORT = 7602
[cache:7]
LINE_RECEIVER_PORT = 2083
PICKLE_RECEIVER_PORT = 2084
CACHE_QUERY_PORT = 7702

[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
LOG_LISTENER_CONNECTIONS = True
RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 1
DIVERSE_REPLICAS = False
DESTINATIONS = 127.0.0.1:2014:0,127.0.0.1:2024:1,127.0.0.1:2034:2,127.0.0.1:2044:3,127.0.0.1:2054:4,127.0.0.1:2064:5,127.0.0.1:2074:6,127.0.0.1:2084:7
MAX_DATAPOINTS_PER_MESSAGE = 50000
MAX_QUEUE_SIZE = 2000000
QUEUE_LOW_WATERMARK_PCT = 0.8
USE_FLOW_CONTROL = True
CARBON_METRIC_PREFIX = carbon
CARBON_METRIC_INTERVAL = 60

[aggregator]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2024
LOG_LISTENER_CONNECTIONS = True
FORWARD_ALL = True
DESTINATIONS = 127.0.0.1:2014:0, 127.0.0.1:2024:1
REPLICATION_FACTOR = 1
MAX_QUEUE_SIZE = 10000
USE_FLOW_CONTROL = True
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_AGGREGATION_INTERVALS = 5

**WEB.CONF**
CARBONLINK_HOSTS = ["127.0.0.1:7002:0", "127.0.0.1:7102:1", "127.0.0.1:7202:2", "127.0.0.1:7302:3", "127.0.0.1:7402:4", "127.0.0.1:7502:5", "127.0.0.1:7602:6", "127.0.0.1:7702:7"]
CARBONLINK_TIMEOUT = 60

The text was updated successfully, but these errors were encountered:

obfuscurity · 2016-08-02T16:52:15Z

If your relay queues are increasing, then it's most likely that the downstream caches aren't able to keep up. How are they doing with CPU? How is I/O behaving?

akurniawan · 2016-08-02T23:09:14Z

Can I say the solution for increasing relay queue will be solved by adding more carbon-cache processes?

As you can see below, CPU and I/O all still below the maximum threshold of the instance
These are the disk metrics from our current instances

These are the instance metrics

obfuscurity · 2016-08-17T14:50:20Z

Note that there's nothing particularly wrong with having lots of queues. As you can see in obfuscurity/synthesize#12 (comment), this system had over 600k queues (unique metric keys) in memory. What's important is that the number of datapoints in memory doesn't bloat over time; the metrics should eventually be flushed to disk (batch writes are efficient, represented by the pointsPerUpdate statistic) and the webapp should be able to fetch and render them in a reasonable amount of time. Play around with your MAX_UPDATES_PER_SECOND value; it should be some value less than your inbound datapoints volume. For example, in my aforementioned link I intentionally set it super-low to 200 (per cache) which forces the batch writes.

TL;DR it's all relative. If your box is performing fine, then the number of queues and datapoints in memory are irrelevant.

P.S. I just noticed that your MAX_CACHE_SIZE is set to infinite. You'll want to set that to a non-infinite value to force Carbon to use batch writes. Keep an eye on your cache's console.log's to see if they hit this limit; adjust as necessary.

akurniawan · 2016-08-18T09:36:01Z

How do i know the number of datapoints in memory doesn't bloat? Does the percentage of used memory suffice?
Related to the above question and your suggestion to set the value of MAX_CACHE_SIZE as something not infinite, we are currently using 16 GB RAM instances and the amount of used memory is still really low (50% usages of memory), is that expected? And what difference does it make if we change the value from infinite to the size of our RAM perhaps?

obfuscurity · 2016-08-30T18:07:25Z

You can track the carbon.agents.<instance>.size metrics, which measure the number of datapoints in memory per cache instance.

Setting MAX_CACHE_SIZE is a good habit and will, when tuned properly, allow you to benefit from batched writes. See #521 (comment) for more detail.

akurniawan · 2016-08-31T00:39:31Z

hi @obfuscurity, I tried that metric you gave me, but the value was shown as negative. Should I use perSecond() function in order to display this data point properly? or something wrong happen with my cluster?

obfuscurity · 2016-08-31T00:46:57Z

There have been some reports of the size metric going negative. #551 is one possible fix.

akurniawan · 2016-08-31T01:27:03Z

@obfuscurity, sorry I don't quite follow the issue in #551, since the issue raised there is about locking something and I don't see any statement for fixing this problem. Am I see it wrong?

obfuscurity · 2016-08-31T01:32:38Z

You asked about the metric showing as negative. I was explaining that it's a bug and there's a potential fix.

deniszh added the question label Mar 19, 2017

deniszh closed this as completed Mar 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Lag #572

Data Lag #572

akurniawan commented Aug 2, 2016 •

edited

Loading

obfuscurity commented Aug 2, 2016

akurniawan commented Aug 2, 2016 •

edited

Loading

obfuscurity commented Aug 17, 2016

akurniawan commented Aug 18, 2016

obfuscurity commented Aug 30, 2016

akurniawan commented Aug 31, 2016 •

edited

Loading

obfuscurity commented Aug 31, 2016

akurniawan commented Aug 31, 2016

obfuscurity commented Aug 31, 2016

Data Lag #572

Data Lag #572

Comments

akurniawan commented Aug 2, 2016 • edited Loading

obfuscurity commented Aug 2, 2016

akurniawan commented Aug 2, 2016 • edited Loading

obfuscurity commented Aug 17, 2016

akurniawan commented Aug 18, 2016

obfuscurity commented Aug 30, 2016

akurniawan commented Aug 31, 2016 • edited Loading

obfuscurity commented Aug 31, 2016

akurniawan commented Aug 31, 2016

obfuscurity commented Aug 31, 2016

akurniawan commented Aug 2, 2016 •

edited

Loading

akurniawan commented Aug 2, 2016 •

edited

Loading

akurniawan commented Aug 31, 2016 •

edited

Loading