Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Lag #572

Closed
akurniawan opened this issue Aug 2, 2016 · 9 comments
Closed

Data Lag #572

akurniawan opened this issue Aug 2, 2016 · 9 comments
Labels

Comments

@akurniawan
Copy link

akurniawan commented Aug 2, 2016

Hi guys, I'm having a bit of difficulty to understand what is the root cause of data lagging in my graphite cluster. Sometimes the new data won't show up until 1 minute or so. I have already upgraded the instances in my cluster, I don't think there are any hardware bottleneck at the moment.
Moreover, I look at my relays metrics, and the number of queue is quite high, around 6k.
What should I do to decrease the number of queue? Is that related to the data lagging I experience?
I tried to find the documentation explaining about every carbon metrics, but I couldn't find it. Does anybody has any reference about this metrics?

**STORAGE-SCHEMAS**
[carbon]
pattern = ^carbon\.
retentions = 60:90d

[default_1min_for_1day]
pattern = .*
retentions = 10s:6h,1m:30d,1d:90d

**STORAGE-AGGREGATION**
[default]
pattern = .*
xFilesFactor = 0.0
aggregationMethod = average

[min]
pattern = \.min$
xFilesFactor = 0.0
aggregationMethod = min

[max]
pattern = \..*(max|p90|p95)$
xFilesFactor = 0.0
aggregationMethod = max

[sum]
pattern = \..*(count|curr_op_\d*_s|faults|mongo_q[wr])$
xFilesFactor = 0.0
aggregationMethod = sum

[avg]
pattern = (\..*(avg|average|_mongo_write_lock)$|(.*cpu.*|.*load\.|\.bytes_per_second\.(write|read)$|\.memory\.free$))
xFilesFactor = 0.0
aggregationMethod = average

[last]
pattern = \.percent_bytes\.free$
xFilesFactor = 0.0
aggregationMethod = last

**CARBON.CONF**
[cache]
LOCAL_DATA_DIR = /whisper
ENABLE_LOGROTATION = True
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 50000
MAX_CREATES_PER_MINUTE = 10000
LINE_RECEIVER_INTERFACE = 127.0.0.1
ENABLE_UDP_LISTENER = False
PICKLE_RECEIVER_INTERFACE = 127.0.0.1
LOG_LISTENER_CONNECTIONS = False
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 127.0.0.1
USE_FLOW_CONTROL = True
LOG_UPDATES = True
LOG_CACHE_HITS = True
LOG_CACHE_QUEUE_SORTS = True
CACHE_WRITE_STRATEGY = sorted
WHISPER_AUTOFLUSH = False
WHISPER_FALLOCATE_CREATE = True
WHISPER_LOCK_WRITES = False
CARBON_METRIC_PREFIX = carbon
CARBON_METRIC_INTERVAL = 60
[cache:0]
LINE_RECEIVER_PORT = 2013
PICKLE_RECEIVER_PORT = 2014
CACHE_QUERY_PORT = 7002
[cache:1]
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_PORT = 2024
CACHE_QUERY_PORT = 7102
[cache:2]
LINE_RECEIVER_PORT = 2033
PICKLE_RECEIVER_PORT = 2034
CACHE_QUERY_PORT = 7202
[cache:3]
LINE_RECEIVER_PORT = 2043
PICKLE_RECEIVER_PORT = 2044
CACHE_QUERY_PORT = 7302
[cache:4]
LINE_RECEIVER_PORT = 2053
PICKLE_RECEIVER_PORT = 2054
CACHE_QUERY_PORT = 7402
[cache:5]
LINE_RECEIVER_PORT = 2063
PICKLE_RECEIVER_PORT = 2064
CACHE_QUERY_PORT = 7502
[cache:6]
LINE_RECEIVER_PORT = 2073
PICKLE_RECEIVER_PORT = 2074
CACHE_QUERY_PORT = 7602
[cache:7]
LINE_RECEIVER_PORT = 2083
PICKLE_RECEIVER_PORT = 2084
CACHE_QUERY_PORT = 7702

[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
LOG_LISTENER_CONNECTIONS = True
RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 1
DIVERSE_REPLICAS = False
DESTINATIONS = 127.0.0.1:2014:0,127.0.0.1:2024:1,127.0.0.1:2034:2,127.0.0.1:2044:3,127.0.0.1:2054:4,127.0.0.1:2064:5,127.0.0.1:2074:6,127.0.0.1:2084:7
MAX_DATAPOINTS_PER_MESSAGE = 50000
MAX_QUEUE_SIZE = 2000000
QUEUE_LOW_WATERMARK_PCT = 0.8
USE_FLOW_CONTROL = True
CARBON_METRIC_PREFIX = carbon
CARBON_METRIC_INTERVAL = 60

[aggregator]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2024
LOG_LISTENER_CONNECTIONS = True
FORWARD_ALL = True
DESTINATIONS = 127.0.0.1:2014:0, 127.0.0.1:2024:1
REPLICATION_FACTOR = 1
MAX_QUEUE_SIZE = 10000
USE_FLOW_CONTROL = True
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_AGGREGATION_INTERVALS = 5

**WEB.CONF**
CARBONLINK_HOSTS = ["127.0.0.1:7002:0", "127.0.0.1:7102:1", "127.0.0.1:7202:2", "127.0.0.1:7302:3", "127.0.0.1:7402:4", "127.0.0.1:7502:5", "127.0.0.1:7602:6", "127.0.0.1:7702:7"]
CARBONLINK_TIMEOUT = 60
@obfuscurity
Copy link
Member

If your relay queues are increasing, then it's most likely that the downstream caches aren't able to keep up. How are they doing with CPU? How is I/O behaving?

@akurniawan
Copy link
Author

akurniawan commented Aug 2, 2016

Can I say the solution for increasing relay queue will be solved by adding more carbon-cache processes?

As you can see below, CPU and I/O all still below the maximum threshold of the instance
These are the disk metrics from our current instances
screen shot 2016-08-03 at 6 03 33 am
screen shot 2016-08-03 at 6 03 39 am
These are the instance metrics
screen shot 2016-08-03 at 6 03 54 am
screen shot 2016-08-03 at 6 04 04 am

@obfuscurity
Copy link
Member

Note that there's nothing particularly wrong with having lots of queues. As you can see in obfuscurity/synthesize#12 (comment), this system had over 600k queues (unique metric keys) in memory. What's important is that the number of datapoints in memory doesn't bloat over time; the metrics should eventually be flushed to disk (batch writes are efficient, represented by the pointsPerUpdate statistic) and the webapp should be able to fetch and render them in a reasonable amount of time. Play around with your MAX_UPDATES_PER_SECOND value; it should be some value less than your inbound datapoints volume. For example, in my aforementioned link I intentionally set it super-low to 200 (per cache) which forces the batch writes.

TL;DR it's all relative. If your box is performing fine, then the number of queues and datapoints in memory are irrelevant.

P.S. I just noticed that your MAX_CACHE_SIZE is set to infinite. You'll want to set that to a non-infinite value to force Carbon to use batch writes. Keep an eye on your cache's console.log's to see if they hit this limit; adjust as necessary.

@akurniawan
Copy link
Author

How do i know the number of datapoints in memory doesn't bloat? Does the percentage of used memory suffice?
Related to the above question and your suggestion to set the value of MAX_CACHE_SIZE as something not infinite, we are currently using 16 GB RAM instances and the amount of used memory is still really low (50% usages of memory), is that expected? And what difference does it make if we change the value from infinite to the size of our RAM perhaps?

@obfuscurity
Copy link
Member

You can track the carbon.agents.<instance>.size metrics, which measure the number of datapoints in memory per cache instance.

Setting MAX_CACHE_SIZE is a good habit and will, when tuned properly, allow you to benefit from batched writes. See #521 (comment) for more detail.

@akurniawan
Copy link
Author

akurniawan commented Aug 31, 2016

hi @obfuscurity, I tried that metric you gave me, but the value was shown as negative. Should I use perSecond() function in order to display this data point properly? or something wrong happen with my cluster?

@obfuscurity
Copy link
Member

There have been some reports of the size metric going negative. #551 is one possible fix.

@akurniawan
Copy link
Author

@obfuscurity, sorry I don't quite follow the issue in #551, since the issue raised there is about locking something and I don't see any statement for fixing this problem. Am I see it wrong?

@obfuscurity
Copy link
Member

You asked about the metric showing as negative. I was explaining that it's a bug and there's a potential fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants