HashAggregate VS sort+GroupAgg-阿里云开发者社区

什么是聚合

聚合就是把元素按照一定的规则分为不同的组，然后对各组元素进行计算。

如图：下面是通过颜色分组，然后计算sum。
_2018_10_23_4_09_08

postgresql中聚合算法

GroupAggregate: get the data sorted and apply aggregation function to groups one by one。
HashAggregate: store state for each key in a hash table

对比两种算法

_2018_10_23_4_17_40

GroupAggregate

GroupAggregatede 特点是在进行聚合之前先要将数据进行排序，然后进行聚合操作，而且出来的结果是有序的。（postgresql使用enable_sort控制）

HashAggregate

特点是不需要进行排序，在组数值比较小的情况下是比GroupAggregate要快很多，但是需求的内存会比较多。只能进行一些简单的聚合，像count（distinct ...）这类聚合是做不了的。

对比两种算法速率

创建9个测试表
create table t_1  (branch_id bigint, amount numeric);
create table t_10   (branch_id bigint, amount numeric);
create table t_100   (branch_id bigint, amount numeric);
create table t_1000   (branch_id bigint, amount numeric);
create table t_10000  (branch_id bigint, amount numeric);
create table t_100000 (branch_id bigint, amount numeric);
create table t_1000000 (branch_id bigint, amount numeric);
create table t_10000000 (branch_id bigint, amount numeric);
create table t_100000000 (branch_id bigint, amount numeric);

每个表的大小为1亿数据量，控制每张表组个数（其实就是distinct值）。
insert into t_1 select mod(i, 1), random()     from generate_series(1,100000000) s(i);
insert into t_10 select mod(i, 10), random()     from generate_series(1,100000000) s(i);
insert into t_100 select mod(i, 100), random()     from generate_series(1,100000000) s(i);
...
vacuum analyze t_1;
vacuum analyze t_10;
vacuum analyze t_100;
...

;

_2018_10_23_4_37_58

可以看到在组个数在10^7以下时，hashagg都要比groupagg优。其中hashagg的耗时也是跟着组个数逐渐升高的。当值达到10^8 时，执行计划会强制走groupagg，即使关闭了enable_sort.

大部分情况下hashagg的效率都会比groupagg要好，主要是sort这个操作比较耗时,本质上hashagg是在用空间（内存）换时间，内存充足的情况下这种做ok，内存不足是容易oom（pg在内存不足的情况下是会强制走groupagg）

原理解析

_2018_10_23_4_48_13
_2018_10_23_4_49_21

应用

应该尽量将groupagg转化为hashagg执行，大部分情况下执行器会自行选择最优算法，但在一些特定情况下，需要手动改写。

表t1

create table t1   (branch_id bigint, amount numeric);
insert into t1 select mod(i, 1000), random()     from generate_series(1,10000000) s(i);

SQL

select branch_id,count(distinct amount) from t1 group by branch_id;

postgres=# explain analyze select branch_id,count(distinct amount) from t1 group by branch_id;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1326385.18..1401396.05 rows=1000 width=16) (actual time=2524.329..8132.612 rows=1000 loops=1)
   Group Key: branch_id
   ->  Sort  (cost=1326385.18..1351385.47 rows=10000115 width=19) (actual time=2517.526..4227.713 rows=10000000 loops=1)
         Sort Key: branch_id
         Sort Method: quicksort  Memory: 1174466kB
         ->  Seq Scan on t1  (cost=0.00..163696.15 rows=10000115 width=19) (actual time=0.015..824.249 rows=10000000 loops=1)
 Planning time: 0.105 ms
 Execution time: 8234.834 ms
(8 rows)

现在需要转化为hashagg语句，强制关闭sort,发现执行计划还是没有改变，事实是hashagg无法处理count(distinct ...)这类聚合

postgres=# explain analyze select branch_id,count(distinct amount) from t1 group by branch_id;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=10001326385.18..10001401396.05 rows=1000 width=16) (actual time=2581.177..7953.331 rows=1000 loops=1)
   Group Key: branch_id
   ->  Sort  (cost=10001326385.18..10001351385.47 rows=10000115 width=19) (actual time=2574.178..4126.988 rows=10000000 loops=1)
         Sort Key: branch_id
         Sort Method: quicksort  Memory: 1174466kB
         ->  Seq Scan on t1  (cost=0.00..163696.15 rows=10000115 width=19) (actual time=0.015..813.226 rows=10000000 loops=1)
 Planning time: 0.082 ms
 Execution time: 8034.338 ms

修改语句

 select branch_id,count(*) from (select branch_id,amount from t1 group by branch_id,amount)_ group by branch_id;

postgres=# explain analyze select branch_id,count(*) from (select branch_id,amount from t1 group by branch_id,amount)_ group by branch_id;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=458272.50..458274.50 rows=200 width=16) (actual time=9544.638..9544.763 rows=1000 loops=1)
   Group Key: t1.branch_id
   ->  HashAggregate  (cost=213696.73..311527.04 rows=9783031 width=19) (actual time=5200.136..7961.864 rows=9999971 loops=1)
         Group Key: t1.branch_id, t1.amount
         ->  Seq Scan on t1  (cost=0.00..163696.15 rows=10000115 width=19) (actual time=0.022..890.532 rows=10000000 loops=1)
 Planning time: 0.197 ms
 Execution time: 9810.855 ms

走了hashagg之后时间并没有减少，why？根据上面的理论，时间应该减少才对，仔细看执行计划，时间上改写之后的语句进行了两次的hashagg，每次时间在4s左右。而且第一次hashagg之后，由于amount这个值的选择度太高，导致后面一次的hashagg数据量还是和前面一样，所以时间也用了4s左右。所以这种改写方法适合在amount选择度较低的时候，减少第二次的hashagg的时间。如果值选择性很高，原语句和耗时数一个sort+groupagg,改写之后的语句是2*hashagg。下面我们把amount的选择度降低，再来看一下效果。

update t1 set amount = branch_id;

postgres=# explain analyze select branch_id,count(distinct amount) from t1 group by branch_id;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=10002533944.49..10002672603.94 rows=1000 width=16) (actual time=3143.167..7858.782 rows=1000 loops=1)
   Group Key: branch_id
   ->  Sort  (cost=10002533944.49..10002580160.97 rows=18486593 width=19) (actual time=3137.152..4561.355 rows=10000000 loops=1)
         Sort Key: branch_id
         Sort Method: quicksort  Memory: 861967kB
         ->  Seq Scan on t1  (cost=0.00..302614.93 rows=18486593 width=19) (actual time=613.417..1537.574 rows=10000000 loops=1)
 Planning time: 0.134 ms
 Execution time: 7925.181 ms
(8 rows)

postgres=# explain analyze select branch_id,count(*) from (select branch_id,amount from t1 group by branch_id,amount)_ group by branch_id;
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=847179.99..847181.99 rows=200 width=16) (actual time=3987.460..3987.611 rows=1000 loops=1)
   Group Key: t1.branch_id
   ->  HashAggregate  (cost=395047.90..575900.73 rows=18085284 width=19) (actual time=3821.612..3986.812 rows=1000 loops=1)
         Group Key: t1.branch_id, t1.amount
         ->  Seq Scan on t1  (cost=0.00..302614.93 rows=18486593 width=19) (actual time=73.574..919.854 rows=10000000 loops=1)
 Planning time: 0.155 ms
 Execution time: 4438.999 ms
(7 rows)

可以看到amount值选择度变低之后，第一次hashagg的时间还是4s左右，但是第二次hashagg的时间明显减少了，所有改写之后的语句耗时要比原语句少

利用pg10并行，原语句是无法使用并行的，修改后的语句可以走并行，这里我们把并行度设置为4，开了并行之后默认是走sort+groupagg的执行计划

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=408450.41..937838.27 rows=200 width=16) (actual time=831.472..1365.904 rows=1000 loops=1)
   Group Key: t1.branch_id
   ->  Group  (cost=408450.41..922858.98 rows=998486 width=12) (actual time=830.731..1365.377 rows=1000 loops=1)
         Group Key: t1.branch_id, t1.amount
         ->  Gather Merge  (cost=408450.41..902889.26 rows=3993944 width=12) (actual time=830.730..1364.782 rows=5000 loops=1)
               Workers Planned: 4
               Workers Launched: 4
               ->  Group  (cost=407950.35..426671.97 rows=998486 width=12) (actual time=794.142..1317.058 rows=1000 loops=5)
                     Group Key: t1.branch_id, t1.amount
                     ->  Sort  (cost=407950.35..414190.89 rows=2496215 width=12) (actual time=794.139..1002.323 rows=2000000 loops=5)
                           Sort Key: t1.branch_id, t1.amount
                           Sort Method: quicksort  Memory: 197526kB
                           ->  Parallel Seq Scan on t1  (cost=0.00..142711.15 rows=2496215 width=12) (actual time=35.558..190.117 rows=2000000 loops=5)
 Planning time: 0.119 ms
 Execution time: 1379.119 ms

从现在的现象看来，hashagg和groupagg的效率还是和许多因素相关的，一般情况下hashagg的效率要更高。在改写count（distinct ...）这类语句是要看值的选择性是不是很高。

参考链接1
参看链接2

HashAggregate VS sort+GroupAgg

什么是聚合

聚合就是把元素按照一定的规则分为不同的组，然后对各组元素进行计算。

postgresql中聚合算法

对比两种算法

GroupAggregate

HashAggregate

对比两种算法速率

原理解析

应用

热门文章

最新文章

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

HashAggregate VS sort+GroupAgg

什么是聚合

聚合就是把元素按照一定的规则分为不同的组，然后对各组元素进行计算。

postgresql中聚合算法

对比两种算法

GroupAggregate

HashAggregate

对比两种算法速率

原理解析

应用

热门文章

最新文章

相关电子书