PostgreSQL 11 preview - 分区表智能并行聚合、分组计算(已类似MPP架构,性能暴增)

本文涉及的产品
RDS MySQL DuckDB 分析主实例,基础系列 4核8GB
RDS AI 助手,专业版
RDS MySQL DuckDB 分析主实例,集群系列 4核8GB
简介:

标签

PostgreSQL , 分区 , 智能聚合 , 智能分组计算 , enable_partition_wise_agg


背景

PostgreSQL 并行计算开始在细节方面进行打磨,例如11已添加了JOIN的分区并行,当两个分区表的分区定义一致时,在分区字段上JOIN就可以用到分区与分区之间直接并行JOIN,而不需要将数据APPEND后在JOIN。

《PostgreSQL 11 preview - 分区表智能并行JOIN (已类似MPP架构,性能暴增)》

现在又一个PATCH要提交了,还是和分区表有关,这次是以分区为单位的聚合、分组计算。

实际上也蛮好理解的,分两种情况,

1、一种是多阶段聚合,原理如下:(常用在MPP数据库,目前PG的单表多阶段并行聚合也是这么做的)

《PostgreSQL 10 自定义并行计算聚合函数的原理与实践 - (含array_agg合并多个数组为单个一元数组的例子)》

2、第二种是智能分区并行聚合,这种情况与第一种不矛盾,但是在非多阶段聚合时也可能能起到优化作用,比如说根据分区字段进行分组聚合,那么分区与分区之间实际上是没有计算集合的重叠的,各自算各自的。

Hi all,  
  
Declarative partitioning is supported in PostgreSQL 10 and work is already  
in  
progress to support partition-wise joins. Here is a proposal for  
partition-wise  
aggregation/grouping.  Our initial performance measurement has shown 7 times  
performance when partitions are on foreign servers and approximately 15%  
when  
partitions are local.  
  
Partition-wise aggregation/grouping computes aggregates for each partition  
separately.  If the group clause contains the partition key, all the rows  
belonging to a given group come from one partition, thus allowing aggregates  
to be computed completely for each partition.  Otherwise, partial aggregates  
computed for each partition are combined across the partitions to produce  
the  
final aggregates. This technique improves performance because:  
i. When partitions are located on foreign server, we can push down the  
aggregate to the foreign server.  
ii. If hash table for each partition fits in memory, but that for the whole  
relation does not, each partition-wise aggregate can use an in-memory hash  
table.  
iii. Aggregation at the level of partitions can exploit properties of  
partitions like indexes, their storage etc.  
  
Attached an experimental patch for the same based on the partition-wise join  
patches posted in [1].  
  
This patch currently implements partition-wise aggregation when group clause  
contains the partitioning key.  A query below, involving a partitioned table  
with 3 partitions containing 1M rows each, producing total 30 groups showed  
15% improvement over non-partition-wise aggregation. Same query showed 7  
times  
improvement when the partitions were located on the foreign servers.  
Patch needs a lot of improvement including:  
1. Support for partial partition-wise aggregation  
2. Estimating number of groups for every partition  
3. Estimating cost of partition-wise aggregation based on sample partitions  
similar to partition-wise join  
and much more.  
  
In order to support partial aggregation on foreign partitions, we need  
support  
to fetch partially aggregated results from the foreign server. That can be  
handled as a separate follow-on patch.  
  
Though is lot of work to be done, I would like to get suggestions/opinions  
from  
hackers.  
  
I would like to thank Ashutosh Bapat for providing a draft patch and helping  
me off-list on this feature while he is busy working on partition-wise join  
feature.  
  
[1]  
https://www.postgresql.org/message-id/CAFjFpRcbY2QN3cfeMTzVEoyF5Lfku-ijyNR%3DPbXj1e%3D9a%3DqMoQ%40mail.gmail.com  
  
Thanks  
  
--   
Jeevan Chalke  
Principal Software Engineer, Product Development  
EnterpriseDB Corporation  
The Enterprise PostgreSQL Company  

这里有两个例子:

开启分区智能聚合,各个分区聚合后再APPEND结果。

Here is the sample plan:  
  
postgres=# set enable_partition_wise_agg to true;  
SET  
postgres=# EXPLAIN ANALYZE SELECT a, count(*) FROM plt1 GROUP BY a;  
                                                  QUERY  
PLAN  
--------------------------------------------------------------------------------------------------------------  
 Append  (cost=5100.00..61518.90 rows=30 width=12) (actual  
time=324.837..944.804 rows=30 loops=1)  
   ->  Foreign Scan  (cost=5100.00..20506.30 rows=10 width=12) (actual  
time=324.837..324.838 rows=10 loops=1)  
         Relations: Aggregate on (public.fplt1_p1 plt1)  
   ->  Foreign Scan  (cost=5100.00..20506.30 rows=10 width=12) (actual  
time=309.954..309.956 rows=10 loops=1)  
         Relations: Aggregate on (public.fplt1_p2 plt1)  
   ->  Foreign Scan  (cost=5100.00..20506.30 rows=10 width=12) (actual  
time=310.002..310.004 rows=10 loops=1)  
         Relations: Aggregate on (public.fplt1_p3 plt1)  
 Planning time: 0.370 ms  
 Execution time: 945.384 ms  
(9 rows)  

关闭分区智能聚合,需要将数据汇聚后,再进行聚合操作。

postgres=# set enable_partition_wise_agg to false;  
SET  
postgres=# EXPLAIN ANALYZE SELECT a, count(*) FROM plt1 GROUP BY a;  
                                                              QUERY  
PLAN  
---------------------------------------------------------------------------------------------------------------------------------------  
 HashAggregate  (cost=121518.01..121518.31 rows=30 width=12) (actual  
time=6498.452..6498.459 rows=30 loops=1)  
   Group Key: plt1.a  
   ->  Append  (cost=0.00..106518.00 rows=3000001 width=4) (actual  
time=0.595..5769.592 rows=3000000 loops=1)  
         ->  Seq Scan on plt1  (cost=0.00..0.00 rows=1 width=4) (actual  
time=0.007..0.007 rows=0 loops=1)  
         ->  Foreign Scan on fplt1_p1  (cost=100.00..35506.00 rows=1000000  
width=4) (actual time=0.587..1844.506 rows=1000000 loops=1)  
         ->  Foreign Scan on fplt1_p2  (cost=100.00..35506.00 rows=1000000  
width=4) (actual time=0.384..1839.633 rows=1000000 loops=1)  
         ->  Foreign Scan on fplt1_p3  (cost=100.00..35506.00 rows=1000000  
width=4) (actual time=0.402..1876.505 rows=1000000 loops=1)  
 Planning time: 0.251 ms  
 Execution time: 6499.018 ms  
(9 rows)  

场景

1、枚举、哈希分区表:

因为在不同分区中,分区字段值非常鲜明分离。可以按分区字段聚合,按分区字段GROUP BY。

2、范围分区表:

某些按分区字段GROUP BY(看GROUP BY的表达式是否有优化空间)

3、继承表,根据子表约束,选择是否可以使用分区智能聚合、GROUP。

参考

https://www.postgresql.org/message-id/flat/CAM2+6=V64_xhstVHie0Rz=KPEQnLJMZt_e314P0jaT_oJ9MR8A@mail.gmail.com#CAM2+6=V64_xhstVHie0Rz=KPEQnLJMZt_e314P0jaT_oJ9MR8A@mail.gmail.com

https://commitfest.postgresql.org/17/1250/

相关实践学习
AnalyticDB PostgreSQL 企业智能数据中台:一站式管理数据服务资产
企业在数据仓库之上可构建丰富的数据服务用以支持数据应用及业务场景;ADB PG推出全新企业智能数据平台,用以帮助用户一站式的管理企业数据服务资产,包括创建, 管理,探索, 监控等; 助力企业在现有平台之上快速构建起数据服务资产体系
目录
相关文章
|
10月前
|
SQL 关系型数据库 PostgreSQL
CTE vs 子查询:深入拆解PostgreSQL复杂SQL的隐藏性能差异
本文深入探讨了PostgreSQL中CTE(公共表表达式)与子查询的选择对SQL性能的影响。通过分析两者底层机制,揭示CTE的物化特性及子查询的优化融合优势,并结合多场景案例对比执行效率。最终给出决策指南,帮助开发者根据数据量、引用次数和复杂度选择最优方案,同时提供高级优化技巧和版本演进建议,助力SQL性能调优。
1118 1
|
SQL 关系型数据库 OLAP
云原生数据仓库AnalyticDB PostgreSQL同一个SQL可以实现向量索引、全文索引GIN、普通索引BTREE混合查询,简化业务实现逻辑、提升查询性能
本文档介绍了如何在AnalyticDB for PostgreSQL中创建表、向量索引及混合检索的实现步骤。主要内容包括:创建`articles`表并设置向量存储格式,创建ANN向量索引,为表增加`username`和`time`列,建立BTREE索引和GIN全文检索索引,并展示了查询结果。参考文档提供了详细的SQL语句和配置说明。
523 2
|
存储 关系型数据库 MySQL
四种数据库对比MySQL、PostgreSQL、ClickHouse、MongoDB——特点、性能、扩展性、安全性、适用场景
四种数据库对比 MySQL、PostgreSQL、ClickHouse、MongoDB——特点、性能、扩展性、安全性、适用场景
|
缓存 关系型数据库 数据库
如何优化 PostgreSQL 数据库性能?
如何优化 PostgreSQL 数据库性能?
766 2
|
缓存 关系型数据库 数据库
PostgreSQL的性能
PostgreSQL的性能
786 2
|
10月前
|
存储 关系型数据库 测试技术
拯救海量数据:PostgreSQL分区表性能优化实战手册(附压测对比)
本文深入解析PostgreSQL分区表的核心原理与优化策略,涵盖性能痛点、实战案例及压测对比。首先阐述分区表作为继承表+路由规则的逻辑封装,分析分区裁剪失效、全局索引膨胀和VACUUM堆积三大性能杀手,并通过电商订单表崩溃事件说明旧分区维护的重要性。接着提出四维设计法优化分区策略,包括时间范围分区黄金法则与自动化维护体系。同时对比局部索引与全局索引性能,展示后者在特定场景下的优势。进一步探讨并行查询优化、冷热数据分层存储及故障复盘,解决分区锁竞争问题。
1348 2
|
关系型数据库 分布式数据库 数据库
|
关系型数据库 分布式数据库 PolarDB
《阿里云产品手册2022-2023 版》——PolarDB for PostgreSQL
《阿里云产品手册2022-2023 版》——PolarDB for PostgreSQL
615 0
|
存储 缓存 关系型数据库
|
存储 SQL 并行计算
PolarDB for PostgreSQL 开源必读手册-开源PolarDB for PostgreSQL架构介绍(中)
PolarDB for PostgreSQL 开源必读手册-开源PolarDB for PostgreSQL架构介绍
786 0

相关产品

  • 云原生数据库 PolarDB
  • 云数据库 RDS PostgreSQL 版
  • 推荐镜像

    更多