这里介绍如何快速做数据批量更新。这里说的更新,又称为“合并”(Merge),就是把数据新版本更新到HybridDB中。如果数据已经存在,则将它们替换为新版本;如果不存在,将它们插入数据库中。一般是离线的做这种数据合并,例如每天一次批量把数据更新到HybridDB中。也有用户需要实时的更新,即做到分钟级甚至秒级延迟。这里我们介绍一下HybridDB中数据合并的方法和背后原理。
简单更新过程
无论怎么做数据合并,都是对数据的修改,即Update、Delete、Insert、Copy等操作。我们先要了解一下HybridDB中的数据更新过程。我们以用户发起一次Update操作为例(对列存表单行记录的更新),整个流程如下图所示。
其中的步骤说明如下:
create table target_table(c1 int, c2 int, primary key (c1));
insert into target_table select generate_series(1, 10000000);
create table source_table(c1 int, c2 int);
insert into source_table select generate_series(1, 100), generate_series(1,100);
set optimizer=on;
update target_table set c2 = source_table.c2 from source_table where target_table.c1= source_table.c1;
=> explain update target_table set c2 = source_table.c2 from source_table where target_table.c1= source_table.c1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Update (cost=0.00..586.10 rows=25 width=1)
-> Result (cost=0.00..581.02 rows=50 width=26)
-> Redistribute Motion 4:4 (slice1; segments: 4) (cost=0.00..581.02 rows=50 width=22)
Hash Key: public.target_table.c1
-> Assert (cost=0.00..581.01 rows=50 width=22)
Assert Cond: NOT public.target_table.c1 IS NULL
-> Split (cost=0.00..581.01 rows=50 width=22)
-> Nested Loop (cost=0.00..581.01 rows=25 width=18)
Join Filter: true
-> Table Scan on source_table (cost=0.00..431.00 rows=25 width=8)
-> Index Scan using target_table_pkey on target_table (cost=0.00..150.01 rows=1 width=14)
Index Cond: public.target_table.c1 = source_table.c1
postgres=> insert into source_table select generate_series(1, 1000), generate_series(1,1000);
INSERT 0 1000
postgres=> analyze source_table;
ANALYZE
postgres=> explain update target_table set c2 = source_table.c2 from source_table where target_table.c1= source_table.c1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Update (cost=0.00..1485.82 rows=275 width=1)
-> Result (cost=0.00..1429.96 rows=550 width=26)
-> Assert (cost=0.00..1429.94 rows=550 width=22)
Assert Cond: NOT public.target_table.c1 IS NULL
-> Split (cost=0.00..1429.93 rows=550 width=22)
-> Hash Join (cost=0.00..1429.92 rows=275 width=18)
Hash Cond: public.target_table.c1 = source_table.c1
-> Table Scan on target_table (cost=0.00..477.76 rows=2500659 width=14)
-> Hash (cost=431.01..431.01 rows=275 width=8)
-> Table Scan on source_table (cost=0.00..431.01 rows=275 width=8)
delete from target_table using source_table where target_table.c1 = source_table.c1;
explain delete from target_table using source_table where target_table.c1 = source_table.c1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Delete (slice0; segments: 4) (rows=50 width=10)
-> Nested Loop (cost=0.00..41124.40 rows=50 width=10)
-> Seq Scan on source_table (cost=0.00..6.00 rows=50 width=4)
-> Index Scan using target_table_pkey on target_table (cost=0.00..205.58 rows=1 width=14)
Index Cond: target_table.c1 = source_table.c1
set optimizer=on;
delete from target_table using source_table where target_table.c1 = source_table.c1;
insert into target_table select * from source_table;
update target_table set c2 = t.c2 from (values(1,1),(2,2),(3,3),…(2000,2000)) as t(c1,c2) where target_table.c1=t.c1
delete from target_table using (values(1,1),(2,2),(3,3),…(2000,2000)) as t(c1,c2) where target_table.c1 = t.c1
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。