参见官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
聚合用于统计ES中的文档。
聚合语句的格式:
"aggregations" : {
"<aggregation_name>" : {
"<aggregation_type>" : {
<aggregation_body>
}
[,"meta" : { [<meta_data_body>] } ]?
[,"aggregations" : { [<sub_aggregation>]+ } ]?
}
[,"<aggregation_name_2>" : { ... } ]*
}
桶(Bucket)聚合
指标(Metric)聚合
矩阵(Matrix)聚合 实验特性
管道(Pipeline)聚合 Pipeline Aggregations 是一组工作在其他聚合计算结果而不是文档集合的聚合。
其中,只有矩阵聚合不能使用脚本。
POST /follower/_search
{
"aggs": {
"avg_rename_days": {
"avg": {
"field": "gender"
}
}
},
"size": 0
}
这里把 size 设置为0的意思是只显示聚合数据。返回:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"avg_rename_days" : {
"value" : 0.3915204490434261
}
}
}
对结果进行操作:
POST /follower/_search
{
"aggs": {
"avg_rename_days": {
"avg": {
"field": "gender",
"script" : {
"lang": "painless",
"source": "_value + params.correction",
"params" : {
"correction" : 1
}
}
}
}
},
"size": 0
}
结果:
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"avg_rename_days" : {
"value" : 1.391520449043426
}
}
}
就是统计某个字段有多少种类型。在非数字类型字段进行统计前要执行:
PUT /follower/_mapping
{
"properties": {
"v": {
"type": "text",
"fielddata": true
}
}
}
然后就可以统计了:
POST /follower/_search
{
"aggs": {
"type_count": {
"cardinality": {
"field": "v",
"precision_threshold": 1
}
}
},
"size": 0
}
结果:
{
"took" : 503,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"type_count" : {
"value" : 1362
}
}
}
可以看到,v 字段的值有 1362 种。
一口气把 avg ,sum,max,min 等等全部求出来:
POST /follower/_search
{
"aggs": {
"stats": {
"extended_stats": {
"field": "gender"
}
}
},
"size": 0
}
结果:
{
"took" : 16,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"stats" : {
"count" : 46677,
"min" : -1.0,
"max" : 1.0,
"avg" : 0.3915204490434261,
"sum" : 18275.0,
"sum_of_squares" : 36441.0,
"variance" : 0.6274174388613534,
"std_deviation" : 0.7920968620448848,
"std_deviation_bounds" : {
"upper" : 1.9757141731331958,
"lower" : -1.1926732750463436
}
}
}
}
max 很简单,就是找最大数。
POST /sales/_search?size=0
{
"aggs" : {
"max_price" : { "max" : { "field" : "price" } }
}
}
min 就是找最小的数:
POST /sales/_search?size=0
{
"aggs" : {
"min_price" : { "min" : { "field" : "price" } }
}
}
percentile 百分位统计是用于评估当前数值分布情况的一种统计方法,比如50 percentile 是 98 , 是指 50%的数值都在98以内,那么通过这里你就可以了解你数据的一个分布特征。常见的一个应用是我们制定 SLA 的时候常说 99% 的请求延迟都在100ms 以内,这个时候你就可以用 99 percentile 来查一下,看一下 99 percenttile 的值如果在 100ms 以内,那么恭喜你 SLA 达标了,否则就说明有问题。
GET latency/_search
{
"size": 0,
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"field" : "load_time"
}
}
}
}
默认是 [ 1, 5, 25, 50, 75, 95, 99 ]:
{
...
"aggregations": {
"load_time_outlier": {
"values" : {
"1.0": 5.0,
"5.0": 25.0,
"25.0": 165.0,
"50.0": 445.0,
"75.0": 725.0,
"95.0": 945.0,
"99.0": 985.0
}
}
}
}
可以指定百分数:
GET latency/_search
{
"size": 0,
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"field" : "load_time",
"percents" : [95, 99, 99.9]
}
}
}
}
理解了 percentile,那么 percentile rank 其实就是反向查询,比如我想看一下 100 在当前数值中处于哪一个范围内,你查一下它的 rank,发现是95,那么说明有95%的数值都在100以内。
GET latency/_search
{
"size": 0,
"aggs" : {
"load_time_ranks" : {
"percentile_ranks" : {
"field" : "load_time",
"values" : [500, 600]
}
}
}
}
结果:
{
...
"aggregations": {
"load_time_ranks": {
"values" : {
"500.0": 90.01,
"600.0": 100.0
}
}
}
}
写脚本来聚合:
POST ledger/_search?size=0
{
"query" : {
"match_all" : {}
},
"aggs": {
"profit": {
"scripted_metric": {
"init_script" : "state.transactions = []",
"map_script" : "state.transactions.add(doc.type.value == 'sale' ? doc.amount.value : -1 * doc.amount.value)",
"combine_script" : "double profit = 0; for (t in state.transactions) { profit += t } return profit",
"reduce_script" : "double profit = 0; for (a in states) { profit += a } return profit"
}
}
}
}
min
, max
, sum
, count
and avg
的统称:
POST /exams/_search?size=0
{
"aggs" : {
"grades_stats" : { "stats" : { "field" : "grade" } }
}
}
结果:
{
...
"aggregations": {
"grades_stats": {
"count": 2,
"min": 50.0,
"max": 100.0,
"avg": 75.0,
"sum": 150.0
}
}
}
计算和
POST /sales/_search?size=0
{
"query" : {
"constant_score" : {
"filter" : {
"match" : { "type" : "hat" }
}
}
},
"aggs" : {
"hat_prices" : { "sum" : { "field" : "price" } }
}
}
结果:
{
...
"aggregations": {
"hat_prices": {
"value": 450.0
}
}
}
有多少种值:
POST /sales/_search?size=0
{
"aggs" : {
"types_count" : { "value_count" : { "field" : "type" } }
}
}
结果:
{
...
"aggregations": {
"types_count": {
"value": 7
}
}
}
绝对中位差是统计学中的一个术语。
绝对中位差是一种统计离差的测量。而且,MAD是一种鲁棒统计量,比标准差更能适应数据集中的异常值。对于标准差,使用的是数据到均值的距离平方,所以大的偏差权重更大,异常值对结果也会产生重要影响。对于MAD,少量的异常值不会影响最终的结果。
由于MAD是一个比样本方差或者标准差更鲁棒的度量,它对于不存在均值或者方差的分布效果更好,比如柯西分布。
例子:
GET reviews/_search
{
"size": 0,
"aggs": {
"review_average": {
"avg": {
"field": "rating"
}
},
"review_variability": {
"median_absolute_deviation": {
"field": "rating"
}
}
}
}
结果:
{
...
"aggregations": {
"review_average": {
"value": 3.0
},
"review_variability": {
"value": 2.0
}
}
}
分组统计文档数量:
PUT /emails/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "accounts" : ["hillary", "sidney"]}
{ "index" : { "_id" : 2 } }
{ "accounts" : ["hillary", "donald"]}
{ "index" : { "_id" : 3 } }
{ "accounts" : ["vladimir", "donald"]}
GET emails/_search
{
"size": 0,
"aggs" : {
"interactions" : {
"adjacency_matrix" : {
"filters" : {
"grpA" : { "terms" : { "accounts" : ["hillary", "sidney"] }},
"grpB" : { "terms" : { "accounts" : ["donald", "mitt"] }},
"grpC" : { "terms" : { "accounts" : ["vladimir", "nigel"] }}
}
}
}
}
}
结果:
{
"took": 9,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"interactions": {
"buckets": [
{
"key":"grpA",
"doc_count": 2
},
{
"key":"grpA&grpB",
"doc_count": 1
},
{
"key":"grpB",
"doc_count": 2
},
{
"key":"grpB&grpC",
"doc_count": 1
},
{
"key":"grpC",
"doc_count": 1
}
]
}
}
}