两个核心概念:bucket和metric
bucket:分组,就是我们的bucket
metric:对一个数据分组执行的统计
家电卖场案例以及统计哪种颜色电视销量最高
以一个家电卖场中的电视销售数据为背景,来对各种品牌,各种颜色的电视的销量和销售额,进行各种各样角度的分析
POST /tvs/sales/_bulk { "index": {}} { "price" : 1000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-10-28" } { "index": {}} { "price" : 2000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-11-05" } { "index": {}} { "price" : 3000, "color" : "绿色", "brand" : "小米", "sold_date" : "2016-05-18" } { "index": {}} { "price" : 1500, "color" : "蓝色", "brand" : "TCL", "sold_date" : "2016-07-02" } { "index": {}} { "price" : 1200, "color" : "绿色", "brand" : "TCL", "sold_date" : "2016-08-19" } { "index": {}} { "price" : 2000, "color" : "红色", "brand" : "长虹", "sold_date" : "2016-11-05" } { "index": {}} { "price" : 8000, "color" : "红色", "brand" : "三星", "sold_date" : "2017-01-01" } { "index": {}} { "price" : 2500, "color" : "蓝色", "brand" : "小米", "sold_date" : "2017-02-12" }
统计哪种颜色的电视销量最高
GET /tvs/sales/_search { "size" : 0, "aggs" : { "popular_colors" : { "terms" : { "field" : "color" } } } }
size:只获取聚合结果,而不要执行聚合的原始数据
aggs:固定语法,要对一份数据执行分组聚合操作
popular_colors:就是对每个aggs,都要起一个名字,这个名字是随机的,你随便取什么都ok
terms:根据字段的值进行分组
field:根据指定的字段的值进行分组
结果
{ "took": 61, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 8, "max_score": 0, "hits": [] }, "aggregations": { "popular_color": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "红色", "doc_count": 4 }, { "key": "绿色", "doc_count": 2 }, { "key": "蓝色", "doc_count": 2 } ] } } }
hits.hits:我们指定了size是0,所以hits.hits就是空的,否则会把执行聚合的那些原始数据给你返回回来
aggregations:聚合结果
popular_color:我们指定的某个聚合的名称
buckets:根据我们指定的field划分出的buckets
key:每个bucket对应的那个值
doc_count:这个bucket分组内,有多少个数据
数量,其实就是这种颜色的销量
每种颜色对应的bucket中的数据的
默认的排序规则:按照doc_count降序排序
统计每种颜色电视平均价格
GET /tvs/sales/_search { "size" : 0, "aggs": { "colors": { "terms": { "field": "color" }, "aggs": { "avg_price": { "avg": { "field": "price" } } } } } }
在一个aggs执行的bucket操作(terms),平级的json结构下,再加一个aggs,这个第二个aggs内部,同样取个名字,执行一个metric操作,avg,对之前的每个bucket中的数据的指定的field,price field,求一个平均值
"aggs": { "avg_price": { "avg": { "field": "price" } } }
就是一个metric,就是一个对一个bucket分组操作之后,对每个bucket都要执行的一个metric
第一个metric,avg,求指定字段的平均值
{ "took": 28, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 8, "max_score": 0, "hits": [] }, "aggregations": { "group_by_color": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "红色", "doc_count": 4, "avg_price": { "value": 3250 } }, { "key": "绿色", "doc_count": 2, "avg_price": { "value": 2100 } }, { "key": "蓝色", "doc_count": 2, "avg_price": { "value": 2000 } } ] } } }
buckets,除了key和doc_count
avg_price:我们自己取的metric aggs的名字
value:我们的metric计算的结果,每个bucket中的数据的price字段求平均值后的结果
select avg(price)
from tvs.sales
group by color
bucket嵌套实现颜色+品牌的多层下钻分析
从颜色到品牌进行下钻分析,每种颜色的平均价格,以及找到每种颜色每个品牌的平均价格
我们可以进行多层次的下钻
比如说,现在红色的电视有4台,同时这4台电视中,有3台是属于长虹的,1台是属于小米的
红色电视中的3台长虹的平均价格是多少?
红色电视中的1台小米的平均价格是多少?
下钻的意思是,已经分了一个组了,比如说颜色的分组,然后还要继续对这个分组内的数据,再分组,比如一个颜色内,还可以分成多个不同的品牌的组,最后对每个最小粒度的分组执行聚合分析操作,这就叫做下钻分析
es,下钻分析,就要对bucket进行多层嵌套,多次分组
按照多个维度(颜色+品牌)多层下钻分析,而且学会了每个下钻维度(颜色,颜色+品牌),都可以对每个维度分别执行一次metric聚合操作
GET /tvs/sales/_search { "size": 0, "aggs": { "group_by_color": { "terms": { "field": "color" }, "aggs": { "color_avg_price": { "avg": { "field": "price" } }, "group_by_brand": { "terms": { "field": "brand" }, "aggs": { "brand_avg_price": { "avg": { "field": "price" } } } } } } } }
结果
{ "took": 8, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 8, "max_score": 0, "hits": [] }, "aggregations": { "group_by_color": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "红色", "doc_count": 4, "color_avg_price": { "value": 3250 }, "group_by_brand": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "长虹", "doc_count": 3, "brand_avg_price": { "value": 1666.6666666666667 } }, { "key": "三星", "doc_count": 1, "brand_avg_price": { "value": 8000 } } ] } }, { "key": "绿色", "doc_count": 2, "color_avg_price": { "value": 2100 }, "group_by_brand": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "TCL", "doc_count": 1, "brand_avg_price": { "value": 1200 } }, { "key": "小米", "doc_count": 1, "brand_avg_price": { "value": 3000 } } ] } }, { "key": "蓝色", "doc_count": 2, "color_avg_price": { "value": 2000 }, "group_by_brand": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "TCL", "doc_count": 1, "brand_avg_price": { "value": 1500 } }, { "key": "小米", "doc_count": 1, "brand_avg_price": { "value": 2500 } } ] } } ] } } }
统计每种颜色电视最大最小价格
count:bucket,terms,自动就会有一个doc_count,就相当于是count
avg:avg aggs,求平均值
max:求一个bucket内,指定field值最大的那个数据
min:求一个bucket内,指定field值最小的那个数据
sum:求一个bucket内,指定field值的总和
一般来说,90%的常见的数据分析的操作,metric,无非就是count,avg,max,min,sum
GET /tvs/sales/_search { "size" : 0, "aggs": { "colors": { "terms": { "field": "color" }, "aggs": { "avg_price": { "avg": { "field": "price" } }, "min_price" : { "min": { "field": "price"} }, "max_price" : { "max": { "field": "price"} }, "sum_price" : { "sum": { "field": "price" } } } } } }
求总和,就可以拿到一个颜色下的所有电视的销售总额
{ "took": 16, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 8, "max_score": 0, "hits": [] }, "aggregations": { "group_by_color": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "红色", "doc_count": 4, "max_price": { "value": 8000 }, "min_price": { "value": 1000 }, "avg_price": { "value": 3250 }, "sum_price": { "value": 13000 } }, { "key": "绿色", "doc_count": 2, "max_price": { "value": 3000 }, "min_price": { "value": }, 1200 "avg_price": { "value": 2100 }, "sum_price": { "value": 4200 } }, { "key": "蓝色", "doc_count": 2, "max_price": { "value": 2500 }, "min_price": { "value": 1500 }, "avg_price": { "value": 2000 }, "sum_price": { "value": 4000 } } ] } } }
hitogram按价格区间统计电视销量和销售额
histogram:类似于terms,也是进行bucket分组操作,接收一个field,按照这个field的值的各个范围区间,进行bucket分组操作
"histogram":{ "field": "price", "interval": 2000 },
interval:2000,划分范围,0~2000,2000~4000,4000~6000,6000~8000,8000~10000,buckets
去根据price的值,比如2500,看落在哪个区间内,比如2000~4000,此时就会将这条数据放入2000~4000对应的那个bucket中
bucket划分的方法,terms,将field值相同的数据划分到一个bucket中
bucket有了之后,一样的,去对每个bucket执行avg,count,sum,max,min,等各种metric操作,聚合分析
GET /tvs/sales/_search { "size" : 0, "aggs":{ "price":{ "histogram":{ "field": "price", "interval": 2000 }, "aggs":{ "revenue": { "sum": { "field" : "price" } } } } } }
{ "took": 13, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 8, "max_score": 0, "hits": [] }, "aggregations": { "group_by_price": { "buckets": [ { "key": 0, "doc_count": 3, "sum_price": { "value": 3700 } }, { "key": 2000, "doc_count": 4, "sum_price": { "value": 9500 } }, { "key": 4000, "doc_count": 0, "sum_price": { "value": 0 } }, { "key": 6000, "doc_count: { "value":": 0, "sum_price" 0 } }, { "key": 8000, "doc_count": 1, "sum_price": { "value": 8000 } } ] } } }