Releases: grafana/metrictank
v1.1
query api
- Graphite pipe query syntax. #1854
- Implement /metrics/expand endpoint #1795
- Meta tags: avoid frequent write locks #1881
- Native aggregateWithWildcards (and sum / multiply / avg) #1863
- Native substr #1363
- Native timeShift #1873
- Tagged delete #1902
monitoring and limits
- Update profiletrigger: switch from vsz to rss and add threshold for heap as well #1914
- Monitoring of number of goroutines and other stats tweaks #1866
- Apply max-series-per-req to non-tagged queries #1926
- MaxSeries limit fixes #1929
- Fix Server.findSeries() limit application #1932, #1934
- Index op timeout #1944
- Various dashboard fixes #1883, #1910
tools
- mt-index-cat: bigtable support #1909
- mt-index-cat: add orgID filter #1942
- add mt-write-delay-schema-explain tool #1895
- add mt-indexdump-rules-analyzer tool #1840
other
- leverage in-flight requests of other replicas to possibly overcome shard query errors #1869
- priority index locking and logging of long operations #1847, #1887
- slice pooling improvements resulting in memory/GC improvements #1858, #1921, #1922, #1923, #1924
- enable index write queue by default #1891
- Improve performance of metatag doesnt exist expr #1920
- SASL support for kafka input and cluster plugins #1956
v1.0
breaking changes
-
as of v0.13.1-38-gb88c3b84 by default we reject data points with a timestamp far in the future.
By default the cutoff is at 10% of the raw retention's TTL, so for example with the default
storage schema1s:35d:10min:7
the cutoff is at35d*0.1=3.5d
.
The limit can be configured by using the parameterretention.future-tolerance-ratio
, or it can
be completely disabled by using the parameterretention.enforce-future-tolerance
.
To predict whether Metrictank would drop incoming data points once the enforcement is turned on,
the metrictank.sample-too-far-ahead
can be used, this metric counts the data points which
would be dropped if the enforcement were turned on while it is off.
#1572 -
Prometheus integration removal. As of v0.13.1-97-gd77c5a31, it is no longer possible to use metrictank
to scrape prometheus data, or query data via Promql. There was not enough usage (or customer interest)
to keep maintaining this functionality.
#1613 -
as of v0.13.1-110-g6b6f475a tag support is enabled by default, it can still be disabled though.
This means if previously metrics with tags have been ingested while tag support was disabled,
then those tags would have been treated as a normal part of the metric name, when tag support
now gets enabled due to this change then the tags would be treated as tags and they wouldn't
be part of the metric name anymore. As a result there is a very unlikely scenario in which some
queries don't return the same results as before, if they query for tags as part of the metric
name. (note: meta tags still disabled by default)
#1619 -
as of v0.13.1-186-gc75005d the
/tags/delSeries
no longer accepts apropagate
parameter.
It is no longer possible to send the request to only a single node, it now always propagates to all nodes, bringing this method in line with/metrics/delete
. -
as of v0.13.1-250-g21d1dcd1 (#951) metrictank no longer excessively aligns all data to the same
lowest comon multiple resolution, but rather keeps data at their native resolution when possible.- When queries request mixed resolution data, this will now typically result in larger response datasets,
with more points, and thus slower responses.
Though the max-points-per-req-soft and max-points-per-req-hard settings will still help curb this problem.
Note that the hard limit was previously not always applied correctly.
Queries may run into this limit (and error) when they did not before. - This version introduces 2 new optimizations (see pre-normalization and mdp-optimization settings).
The latter is experimental and disabled by default, but the former is recommended and enabled by default.
It helps with alleviating the extra cost of queries in certain cases
(See https://github.com/grafana/metrictank/blob/master/docs/render-path.md#pre-normalization for more details)
When upgrading a cluster in which you want to enable pre-normalization (recommended),
you must apply caution: pre-normalization requires a PNGroup property to be
communicated in intra-cluster data requests, which older peers don't have.
The peer receiving the client request, which fans out the query across the cluster, will only set
the flag if the optimization is enabled (and applicable). If the flag is set for the requests,
it will need the same flag set in the responses it receives from its peers in order to tie the data back to the initiating requests.
Otherwise, the data won't be included in the response, which may result in missing series, incorrect aggregates, etc.
Peers responding to a getdata request will include the field in the response, whether it has the
optimization enabled or not.
Thus, to upgrade an existing cluster, you have 2 options:
A) disable pre-normalization, do an in-place upgrade. enable it, do another in-place upgrade.
This works regardless of whether you have a separate query peers, and regardless of whether you first
upgrade query or shard nodes.
B) do a colored deployment: create a new gossip cluster that has the optimization enabled from the get-go,
then delete the older deployment. - When queries request mixed resolution data, this will now typically result in larger response datasets,
-
as of v0.13.1-384-g82dedf95 the meta record index configuration parameters have been moved out
of the sectioncassandra-idx
, they now have their own sectioncassandra-meta-record-idx
. -
as of v0.13.1-433-g4c801819, metrictank proxies bad requests to graphite.
though as of v0.13.1-577-g07eed80f this is configurable via thehttp.proxy-bad-requests
flag.
Leave enabled if your queries are in the grey zone (rejected by MT, tolerated by graphite),
disable if you don't like the additional latency.
The aspiration is to remove this entire feature once we work out any more kinks in metrictank's request validation. -
as of v0.13.1-788-g79e4709 (see: #1831) the option
reject-invalid-tags
was removed. Another option namedreject-invalid-input
was added to take its place, and the default value is set totrue
. This new option rejects invalid tags and invalid UTF8 data found in either the metric name or the tag key or tag value. The exported statinput.xx.metricdata.discarded.invalid_tag
was also changed toinput.xx.metricdata.discarded.invalid_input
, so dashboards will need to be updated accordingly.
index
- performance improvement meta tags #1541, #1542
- Meta tag support bigtable. #1646
- bugfix: return correct counts when deleting multiple tagged series. #1641
- fix: auto complete should not ignore meta tags if they are also metric tags. #1649
- fix: update cass/bt index when deleting tagged metrics. #1657
- fix various index bugs. #1664, #1667, #1748, #1766, #1833
- bigtable index fix: only load current metricdefs. #1564
- Fix deadlock when write queue full. #1569
fakemetrics
- filters. first filter is an "offset filter". #1762
- import 'schemasbackfill' mode. #1666
- carbon tag support. #1691
- add values policy. #1773
- configurable builders + "daily-sine" value policy. #1815
- add a "Containers" mode to fakemetrics with configurable churn #1859
other tools
- mt-gateway: new tool to receive data over http and save into kafka, for MT to consume. #1608, #1627, #1645
- mt-parrot: continuous validation by sending dummy stats and querying them back. #1680
- mt-whisper-importer-reader: print message when everything done with the final stats. #1617
new native processing functions
- aggregate() #1751
- aliasByMetric() #1755
- constantLine() #1734 (note: due to a yet undiagnosed bug, was disabled in #1783 )
- groupByNode, groupByNodes. #1753, #1774
- invert(). #1791
- minMax(). #1792
- offset() #1621
- removeEmptySeries() #1754
- round() #1719
- unique() #1745
other
- dashboard tweaks. #1557, #1618
- docs improvements #1559 , #1620, #1594, #1796
- tags/findSeries - add lastts-json format. #1580
- add catastrophe recovery for cassandra (re-resolve when all IP's have changed). #1579
- Add
/tags/terms
query to get counts of tag values #1582 - expr: be more lenient: allow quoted ints and floats #1622
- Replaced Macaron logger with custom logger enabling query statistics. #1634
- Logger middleware: support gzipped responses from graphite-web. #1693
- Fix error status codes. #1684
- Kafka ssl support. #1701
- Aggmetrics: track how long a GC() run takes and how many series deleted #1746
- only connect to peers that have non-null ip address. #1758
- Added a panic recovery middleware after the logger so that we know what the query was that triggered a panic. #1784
- asPercent: don't panic on empty input. #1788
- Return 499 http code instead of 502 when client disconnect during a render query with graphite proxying. #1821
- Deduplicate resolve series requests. #1794
- Deduplicate duplicate fetches #1855
- set points-return more accurately. #1835
Meta tag and http api improvements, lineage metadata, per partition metrics and more
meta tags
- correctly clear enrichment cache on upsert #1472
- meta tag records must be optional in meta tag upsert requests #1473
- Persist meta records to Cassandra index #1471
- Remove hashing in enricher #1512
- skip meta tag enrichment when we can #1515
- Optimize autocomplete queries #1514
- Performance optimizations for meta tag queries #1517
- Rewrite enricher to use map lookup #1523
reorder buffer
- version v0.13.0-188-g6cd12d6 introduces storage-schemas.conf option 'reorderBufferAllowUpdate' to allow for some data to arrive out of order. #1531
http api
- Added "ArgQuotelessString" argument for interpreting strings in the request that don't have quotes around tem (e.g. keepLastValue INF)
- Fix /find empty response resulting in "null" #1464
- patch and update macaron/binding middleware to support content-length header for GET requests #1466
- Fix removeAboveBelowPercentile panic #1518
- rollup indicator (lineage information in response metadata) #1481, #1508
- return proper errors upon bad request from user #1520
- correct for delayed lastUpdate updates #1532
monitoring
- report the priority and readiness per partition as a metric #1504, #1507
- dashboard fix: maxSeries not a valid groupByNodes callback. #1491
- MemoryReporter: make call to runtime.ReadMemStats time bound to avoid lost metrics #1494
tools
- remove deprecated cli argument ttls from mt-whisper-importer-writer
- add tool to calculate the id of metrics: mt-keygen #1526
misc
- Don't validate MaxChunkSpan if BigTable store is disabled #1470
- lower default max chunk cache size to 512MB #1476
- add initial hosted metrics graphite documentation #1501, #1502
- Add 'benchmark' target to Makefile that runs all benchmarks #1498
- in cluster calls, set user agent #1469
docker stack
Meta tags beta, sharding by tags, new importer (bigtable!), response stats, memory-idx write queue and many fixes
breaking changes
- as of v0.12.0-404-gc7715cb2 we clean up poorly formatted graphite metrics better. To the extent that they have previously worked, queries may need some adjusting
#1435 - version v0.12.0-96-g998933c3 introduces config options for the cassandra/scylladb index table names.
The default settings and schemas match the previous behavior, but people who have customized the schema-idx template files
should know that we now no longer only expand the keyspace (and assume a hardcoded table name).
Now both theschema_table
andschema_archive_table
sections in the template name should have 2%s
sections which will be
expanded to thekeyspace
andtable
, orkeyspace
andarchive-table
settings respectively configured undercassandra-idx
of the metrictank config file. - version v0.12.0-81-g4ee87166 and later reject metrics with invalid tags on ingest by default, this can be disabled via the
input.reject-invalid-tags
flag.
if you're unsure whether you're currently sending invalid tags, it's a good idea to first disable the invalid tag rejection and watch the
new counter calledinput.<input name>.metricdata.discarded.invalid_tag
, if invalid tags get ingested this counter will increase without
rejecting them. once you're sure that you don't ingest invalid tags you can enable rejection to enforce the validation.
more information on #1348 - version v0.12.0-54-g6af26a3d and later have a refactored jaeger configuration + many more options #1341
the following config options have been renamed:tracing-enabled -> jaeger.enabled
tracing-addr -> jaeger.agent-addr
tracing-add-tags -> jaeger.add-tags
(now also key=value instead of key:value)
- as of v0.12.0-43-g47bd3cb7 mt-whisper-importer-writer defaults to the new importer path, "/metrics/import" instead of "/chunks" and
uses a "http-endpoint" flag instead of "listen-address" and "listen-port".
importer
- bigtable importer #1291
- Make the importer utilities rely on TSDB-GW for authentication and org-association #1335
- fix TTL bug: calculate TTL relative to now when inserting into cassandra. #1448
other
- meta tags (beta feature):
- fix kafka backlog processing to not get stuck/timeout if no messages #1315, #1328, #1350, #1352, #1360
- memleak fix: Prevent blocked go routines to hang forever #1333, #1337
- update jaeger client v2.15.0 -> v2.16.0, jaeger-lib v1.2.1 -> v2.0.0 #1339
- Update Shopify/sarama from v1.19.0 to v1.23.0
- add orgid as jaeger tag, to ease searching by orgid #1366
- Fix active series stats decrement #1336
- render response metadata: stats #1334
- fix prometheus input plugin resetting its config at startup #1346
- make index/archive tables configurable #1348
- add writeQueue buffer to memoryIdx #1365
- remove tilde from name values when indexing tags #1371
- Jaeger cleanup: much fewer spans, but with more stats - and more stats for meta section #1380, #1410
- increase connection pool usage #1412
- New flag 'ingest-from' #1382
- flush aggregates more eagerly when we can #1425
- Peer query speculative fixes and improvements #1430
- support sharding by tags #1427, #1436, #1444
- Fix uneven length panics #1452
new query api functions
Query nodes, find cache and various performance tweaks
Important changes that require your attention:
-
This release includes the "query layer" functionality.
Versions prior to v0.11.0-184-g293b55b9 cannot handle query nodes joining the cluster and will crash.
To deploy the new query nodes and introduce them into the cluster, you must first
upgrade all other nodes to this version (or later)
Also, regarding cluster.mode: -
since v0.11.0-169-g59ebb227, kafka-version now defaults to 2.0.0 instead of 0.10.0.0. Make sure to set
this to a proper version if your brokers are not at least at version 2.0.0.
See #1221 -
since v0.11.0-233-gcf24c43a, if queries need rollup data, but asked for a consolidateBy() without matching rollup aggregation
we pick the most approriate rollup from what is available. -
since v0.11.0-252-ga1e41192, remove log-min-dur flag, it was no longer used. #1275
-
Since v0.11.0-285-g4c862d8c, duplicate points are now always rejected, even with the reorder buffer enabled.
Note from the future: this was undone in v0.13.0-188-g6cd12d6, see future notes about reorderBufferAllowUpdate
index
- cassandra index: load partitions in parallel. #1270
- Add partitioned index (experimental and not recommended) #1232
- add the mt-index-prune utility #1231, #1235
- fix index corruption: make sure defBy{Id,TagSet} use same pointers #1303
api
- Improve performance of SetTags #1158
- speed up cross series aggregators by being more cache friendly #1164
- Fix summarize crash #1170
- groupByTags Performance improvements + fix setting consolidator per group + fix alias functions name tag setting #1165
- Meta tags part 1: meta record data structures and corresponding CRUD api calls (experimental) #1301
- Add absolute function #1300
- Add find cache to speed up render requests #1233, #1236, #1263, #1265, #1266, #1285
- Added 8 functions related to filterSeries #1308
- Added cumulative function #1309
docs
monitoring
- fix render latency queries + remove old dashboards #1192
- Dashboard: mem.to_iter fix and use UTC #1219
- refactor ingestion related metrics, in particular wrt drops. add prometheus stats. #1278, #1288
- fix decrement tank.total_points upon GC. fix #1239
Misc
- Dynamic GOGC based on ready state #1194
- improve kafka-mdm stats/priority tracking #1200
- tweak cluster priority calculation to be resilient against GC pauses #1022, #1218
- update messagepack to v1.1 #1214
- mt-index-cat should use NameWithTags() when listing series #1267
- improvement for reorder buffer. #1211
- debug builds and images. #1187
Bigtable, chunk formats, fixes and breaking changes
Important changes that require your attention:
-
with our previous chunk format, when both:
- using chunks of >4 hours
- the time delta between start of chunk and first point is >4.5 hours
the encoded delta became corrupted and reading the chunk results in incorrect data.
This release brings a remediation to recover the data at read time, as well
as a new chunk format that does not suffer from the issue.
The new chunks are also about 9 bytes shorter in the typical case.
While metrictank now writes to the store exclusively using the new format, it can read from the store in any of the formats.
This means readers should be upgraded before writers,
to avoid the situation where an old reader cannot parse the chunk written by a newer
writer during an upgrade. See #1126, #1129 -
we now use logrus for logging #1056, #1083
Log levels are now strings, not integers.
See the updated config file -
index pruning is now configurable via index-rules.conf #924, #1120
We no longer use amax-stale
setting in thecassandra-idx
section,
and instead gained anindex-rules-conf
setting. -
The NSQ cluster notifier has been removed. NSQ is a delight to work with, but we could
only use it for a small portion of our clustering needs, requiring Kafka anyway for data ingestion
and distribution. We've been using Kafka for years and neglected the NSQ notifier code, so it's time to rip it out.
See #1161 -
the offset manager for the kafka input / notifier plugin has been removed since there was no need for it.
offset=last
is thus no longer valid. See #1110
index and store
- support for bigtable index and storage #1082, #1114, #1121
- index pruning rate limiting #1065 , #1088
- clusterByFind: limit series and streaming processing #1021
- idx: better log msg formatting, include more info #1119
clustering
- fix nodes sometimes not becoming ready by dropping node updates that are old or about thisNode. #948
operations
- disable tracing for healthchecks #1054
- Expose AdvertiseAddr from the clustering configuration #1063 , #1097
- set sarama client KafkaVersion via config #1103
- Add cache overhead accounting #1090, #1184
- document cache delete #1122
- support per-org
metrics_active
for scraping by prometheus #1160 - fix idx active metrics setting #1169
- dashboard: give rows proper names #1184
tank
- cleanup GC related code #1166
- aggregated chunk GC fix (for sparse data, aggregated chunks were GC'd too late, which may result in data loss when doing cluster restarts),
also lower defaultmetric-max-stale
#1175, #1176 - allow specifying timestamps to mark archives being ready more granularly #1178
tools
- mt-index-cat: add partition support #1068 , #1085
- mt-index-cat: add
min-stale
option, renamemax-age
tomax-stale
#1064 - mt-index-cat: support custom patterns and improve bench-realistic-workload-single-tenant.sh #1042
- mt-index-cat: make
NameWithTags()
callable from template format #1157 - mt-store-cat: print t0 of chunks #1142
- mt-store-cat: improvements: glob filter, chunk-csv output #1147
- mt-update-ttl: tweak default concurrency, stats fix, properly use logrus #1167
- mt-update-ttl: use standard store, specify TTL's not tables, auto-create tables + misc #1173
- add mt-kafka-persist-sniff tool #1161
- fixes #1124
misc
- better benchmark scripts #1015
- better documentation for our input formats #1071
- input: prevent integer values overflowing our index datatypes, which fixes index saves blocking #1143
- fix ccache memory leak #1078
- update jaeger-client to 2.15.0 #1093
- upgrade Sarama to v1.19 #1127
- fix panic caused by multiple closes of pluginFatal channel #1107
- correctly return error from NewCassandraStore() #1111
- clean way of skipping expensive and integration tests. #1155, #1156
- fix duration vars processing and error handling in cass idx #1141
- update release process, tagging, repo layout and version formatting. update to go1.11.4 #1177, #1180, #1181
- update docs for bigtable, storage-schemas.conf and tank GC #1182
performance fix: pruning effect on latency, go 1.11, etc
- when pruning index, use more granular locking (prune individual metrics separately rather then all at once). this can significantly reduce request latencies, as previously, some render requests could be blocked a long time by long running index prunes (which was especially worse with high series churn). now there is practically no latency overhead (though prunes now run longer but that's not a problem) #1062
- emit the current MT version as a metric #1041
- k8s: allow passing
GO*
variables asMT_GO*
#1044 - better docs for running MT via docker #1040, #1047
- make fragile duration integer config values parseable as string durations #1017
- go 1.11 #1045
- no need for
$(go list ./... | grep -v /vendor/)
#1050
Clustering important bugfix + faster ingest, speculative query execution, more graphite functions and lots more
There was a bug in 0.9 which caused instances to incorrectly encode Id's for tracking saved rollup chunks, which in some cases could cause data loss when write nodes restarted and would overwrite rollup chunks with partial chunks. Because of this, we strongly recommend upgrading to this version.
index
- use atomics in index ingestion path, yielding about a ~30% ingestion speed improvement. dbd7440, #945
- Fix multi-interval series issue, handle regressing timestamps #897
- fix race condition in Findtags #946
store and chunk cache
- support cosmosdb, cassandra connectTimeout #922
- refactor cassandrastore read queries (reduces cassandra CPU usage), support disabling chunkcache #931
- chunk cache: Block to submit accounting events, add some metrics #1010
- workaround chunk cache corruption bug 10a745c
- chunk cache perf fixes: AddRange + batched accounting #943 #1006
- chunk cache corruption testing #1009
core
- cleanup from-to range error handling in dataprocessor, aggmetrics, cache, store + fixes #919
- read from first chunks #994
- Speculative query execution #956 #979 #1000
- fix clustering chunk ID referencing. #972
- fix logger #991
- set clustering (notifier) partitions more directly #952
API server
- new functions: isNonNull #959 , scaleToSeconds #970 , countSeries #974 , filterSeries #980 , removeAboveBelowValue 34febb0, highest/lowest a958b51, asPercent #966 , derivative and nonNegativeDerivative #996 , sortBy, sortByMaxima, and sortByTotal #997 , removeAbovePercentile and removeBelowPercentile #992, keepLastValue #995
- Fix summarize function #928
- Add show plan endpoint #961
- update gziper, saving memory allocations #964
- workaround invalid HTTP status codes #987
- endpoint for analyzing priority calculation #932
stats
- handle write errors/timeouts and remote connection close better #918
- Fix points return measurement overflow #953
- fix pointsReturn to consider MaxDataPoints based consolidation #957
- Monitor rss, vsz, cpu usage and page faults #1028
- expose a few key metrics as rates as well #920
build, docker environments & CI
- move metrictank.go to cmd directory #935, #939
- vendoring #934
- fix dep, don't use gopkg.in for schema #1004
- go 1.10.3, qa tests #990
- remove bogus test flags #1005
- grafana/grafana:latest in all docker envs #1029
- better qa scripts #1034
- docker benchmark updates #1037
- Docker grafana automatic provisioning #1039
Docs
- document graphite functions better #998
- cassandra, devdocs #1003
- update docker quickstart guide #1038
- dev docs #1035
- Doc updates #927
Tools
major kafka ingestion format changes + some other stuff
kafka format upgrade
support for new MetricPoint optimized data format in the kafka mdm topic, resulting in less kafka io, disk usage, GC workload, metrictank and kafka cpu usage, faster backfills. #876 , #885 #890, #891, #894, #911
this also comes with:
- new dashboard
- updates of internal representations of keys, archives, index structures, intra-cluster requests, etc.(so you must do a colored upgrade or new cluster deployment)
- metrics changes for metrics_received, metrics_invalid changed see 01dabf9, a772c10
- removal of mt-replicator-via-tsdb tool
- deprecation of .Metric field in MetricData and MetricDefinition. it is now ignored in incoming MetricData and in the cassandra index tables.
- mt-schemas-explain: also show built-in default
- various update to some of the docker stacks (use latest graphite, control x-org-id authentication, use prometheus docker monitoring, update for latest fakemetrics and tsdb-gw, etc)
other
- upgrade to latest sarama (kafka) library. #905
- remove pressure.idx and pressure.tank metrics, too much overhead and too little use. #905
- sarama reduce default kafka-mdm channel-buffer-size #886
- refactor chaos testing and carbon end2end test into a unified testing library and test functions. #830
- migrate "public org id" functionality from magic org=-1 to a >0 orgid to be specified via
public-org
config setting. #880, #883 - fix cluster load balancing #884
- initialize cassandra schemas via template files, support scylladb schemas. #898
- support adding arbitrary extra tags in jaeger traces #904
- Accept bool strings for ArgBool #895