PostgreSQL 10.0 preview 功能增强 - hash index 支持wal(灾难恢复)

简介:

标签

PostgreSQL , 10.0 , hash index , wal , 灾难恢复


背景

PostgreSQL 10.0 将支持hash index WAL. 因此建hash index再也不怕数据库crash或者备库hash index不可用了。

$SUBJECT will make hash indexes reliable and usable on standby.  
AFAIU, currently hash indexes are not recommended to be used in  
production mainly because they are not crash-safe and with this patch,  
I hope we can address that limitation and recommend them for use in  
production.  

This patch is built on my earlier patch [1] of making hash indexes  
concurrent.  The main reason for doing so is that the earlier patch  
allows to complete the split operation and used light-weight locking  
due to which operations can be logged at granular level.  

WAL for different operations:  

This has been explained in README as well, but I am again writing it  
here for the ease of people.  
=================================  

Multiple WAL records are being written for create index operation,  
first for initializing the metapage, followed by one for each new  
bucket created during operation followed by one for initializing the  
bitmap page.  If the system crashes after any operation, the whole  
operation is rolledback.  I have considered to write a single WAL  
record for the whole operation, but for that we need to limit the  
number of initial buckets that can be created during the operation. As  
we can log only fixed number of pages XLR_MAX_BLOCK_ID (32) with  
current XLog machinery, it is better to write multiple WAL records for  
this operation.  The downside of restricting the number of buckets is  
that we need to perform split operation if the number of tuples are  
more than what can be accommodated in initial set of buckets and it is  
not unusual to have large number of tuples during create index  
operation.  

Ordinary item insertions (that don't force a page split or need a new  
overflow page) are single WAL entries.  They touch a single bucket  
page and meta page, metapage is updated during replay as it is updated  
during original operation.  

An insertion that causes an addition of an overflow page is logged as  
a single WAL entry preceded by a WAL entry for a new overflow page  
required to insert a tuple.  There is a corner case where by the time  
we try to use newly allocated overflow page, it already gets used by  
concurrent insertions, for such a case, a new overflow page will be  
allocated and a separate WAL entry will be made for the same.  

An insertion that causes a bucket split is logged as a single WAL  
entry, followed by a WAL entry for allocating a new bucket, followed  
by a WAL entry for each overflow bucket page in the new bucket to  
which the tuples are moved from old bucket, followed by a WAL entry to  
indicate that split is complete for both old and new buckets.  

A split operation which requires overflow pages to complete the  
operation will need to write a WAL record for each new allocation of  
an overflow page. As splitting involves multiple atomic actions, it's  
possible that the system crashes between moving tuples from bucket  
pages of old bucket to new bucket.  After recovery, both the old and  
new buckets will be marked with in_complete split flag.  The reader  
algorithm works correctly, as it will scan both the old and new  
buckets.  

We finish the split at next insert or split operation on old bucket.  
It could be done during searches, too, but it seems best not to put  
any extra updates in what would otherwise be a read-only operation  
(updating is not possible in hot standby mode anyway).  It would seem  
natural to complete the split in VACUUM, but since splitting a bucket  
might require to allocate a new page, it might fail if you run out of  
disk space.  That would be bad during VACUUM - the reason for running  
VACUUM in the first place might be that you run out of disk space, and  
now VACUUM won't finish because you're out of disk space.  In  
contrast, an insertion can require enlarging the physical file anyway.  

Deletion of tuples from a bucket is performed for two reasons, one for  
removing the dead tuples and other for removing the tuples that are  
moved by split.  WAL entry is made for each bucket page from which  
tuples are removed, followed by a WAL entry to clear the garbage flag  
if the tuples moved by split are removed.  Another separate WAL entry  
is made for updating the metapage if the deletion is performed for  
removing the dead tuples by vaccum.  

As deletion involves multiple atomic operations, it is quite possible  
that system crashes after (a) removing tuples from some of the bucket  
pages (b) before clearing the garbage flag (c) before updating the  
metapage.  If the system crashes before completing (b), it will again  
try to clean the bucket during next vacuum or insert after recovery  
which can have some performance impact, but it will work fine. If the  
system crashes before completing (c), after recovery there could be  
some additional splits till the next vacuum updates the metapage, but  
the other operations like insert, delete and scan will work correctly.  
We can fix this problem by actually updating the metapage based on  
delete operation during replay, but not sure if it is worth the  
complication.  

Squeeze operation moves tuples from one of the buckets later in the  
chain to one of the bucket earlier in chain and writes WAL record when  
either the bucket to which it is writing tuples is filled or bucket  
from which it is removing the tuples becomes empty.  

As Squeeze operation involves writing multiple atomic operations, it  
is quite possible, that system crashes before completing the operation  
on entire bucket.  After recovery, the operations will work correctly,  
but the index will remain bloated and can impact performance of read  
and insert operations until the next vacuum squeezes the bucket  
completely.  

=====================================  


One of the challenge in writing this patch was that the current code  
was not written with a mindset that we need to write WAL for different  
operations.  Typical example is _hash_addovflpage() where pages are  
modified across different function calls and all modifications needs  
to be done atomically, so I have to refactor some code so that the  
operations can be logged sensibly.  

Thanks to Ashutosh Sharma who has helped me in completing the patch by  
writing WAL for create index and delete operation and done the  
detailed testing of patch by using pg_filedump tool. I think it is  
better if he himself explains the testing he has done to ensure  
correctness of patch.  


Thoughts?  

Note - To use this patch, first apply latest version of concurrent  
hash index patch [2].  

[1] - https://commitfest.postgresql.org/10/647/  
[2] - https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com  

--   
With Regards,  
Amit Kapila.  
EnterpriseDB: http://www.enterprisedb.com  

这个patch的讨论,详见邮件组,本文末尾URL。

PostgreSQL社区的作风非常严谨,一个patch可能在邮件组中讨论几个月甚至几年,根据大家的意见反复的修正,patch合并到master已经非常成熟,所以PostgreSQL的稳定性也是远近闻名的。

参考

https://commitfest.postgresql.org/13/740/

https://www.postgresql.org/message-id/flat/CAA4eK1JOBX=YU33631Qh-XivYXtPSALh514+jR8XeD7v+K3r_Q@mail.gmail.com/

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍如何基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
关系型数据库 Serverless 分布式数据库
【公测】PolarDB PostgreSQL版Serverless功能免费使用​!
【公测】PolarDB PostgreSQL版Serverless功能免费使用​,公测于2024年3月28日开始,持续三个月,公测期间可以免费使用!
|
存储 关系型数据库 数据库
深入了解 PostgreSQL:功能、特性和部署
PostgreSQL,通常简称为Postgres,是一款强大且开源的关系型数据库管理系统(RDBMS),它在数据存储和处理方面提供了广泛的功能和灵活性。本文将详细介绍 PostgreSQL 的功能、特性以及如何部署和使用它。
1362 1
深入了解 PostgreSQL:功能、特性和部署
|
关系型数据库 Serverless 分布式数据库
PolarDB PostgreSQL版Serverless功能上线公测啦,公测期间免费使用!
Serverless数据库能够使得数据库集群资源随客户业务负载动态弹性扩缩,将客户从复杂的业务资源评估和运维工作中解放出来。PolarDB PostgreSQL版 Serverless提供了CPU、内存、存储、网络资源的实时弹性能力,构建计算与存储分离架构下的 PolarDB PostgreSQL版产品新形态。
|
SQL 关系型数据库 分布式数据库
在PolarDB for PostgreSQL中,你可以使用LIKE运算符来实现类似的查询功能,而不是使用IF函数
在PolarDB for PostgreSQL中,你可以使用LIKE运算符来实现类似的查询功能,而不是使用IF函数
221 7
|
SQL 关系型数据库 分布式数据库
在PolarDB for PostgreSQL中,你可以使用LIKE运算符来实现类似的查询功能
在PolarDB for PostgreSQL中,你可以使用LIKE运算符来实现类似的查询功能【1月更文挑战第13天】【1月更文挑战第65篇】
167 2
|
关系型数据库 Linux Shell
Centos系统上安装PostgreSQL和常用PostgreSQL功能
Centos系统上安装PostgreSQL和常用PostgreSQL功能
|
6月前
|
存储 关系型数据库 测试技术
拯救海量数据:PostgreSQL分区表性能优化实战手册(附压测对比)
本文深入解析PostgreSQL分区表的核心原理与优化策略,涵盖性能痛点、实战案例及压测对比。首先阐述分区表作为继承表+路由规则的逻辑封装,分析分区裁剪失效、全局索引膨胀和VACUUM堆积三大性能杀手,并通过电商订单表崩溃事件说明旧分区维护的重要性。接着提出四维设计法优化分区策略,包括时间范围分区黄金法则与自动化维护体系。同时对比局部索引与全局索引性能,展示后者在特定场景下的优势。进一步探讨并行查询优化、冷热数据分层存储及故障复盘,解决分区锁竞争问题。
821 2
|
关系型数据库 分布式数据库 PolarDB
《阿里云产品手册2022-2023 版》——PolarDB for PostgreSQL
《阿里云产品手册2022-2023 版》——PolarDB for PostgreSQL
562 0
|
存储 缓存 关系型数据库

相关产品

  • 云原生数据库 PolarDB
  • 云数据库 RDS PostgreSQL 版
  • 推荐镜像

    更多