PostgreSQL 模糊查询与正则匹配性能差异与SQL优化建议-阿里云开发者社区

PostgreSQL 模糊查询与正则匹配性能差异与SQL优化建议

2018-01-27 5789

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

RDS AI 助手，专业版

RDS Agent（兼容OpenClaw），2核4GB

PolarDB Agent Express，2核4GB

简介：

背景

PostgreSQL通过 pg_trgm插件，可以支持正则表达式、LIKE 前后模糊查询。

(要支持中文的话，必须确保lc_collate和lc_ctype <> C)

《中文模糊查询性能优化 by PostgreSQL trgm》

从语义上来讲，以下两个查询的语义是一样的。

select * from test where col like '%xxxxxx%';   
   
select * from test where col ~ 'xxxxxx';

但是在数据库内部的处理上，使用了不同的处理逻辑，分别对应如下代码：

src/backend/utils/adt/like.c

src/backend/utils/adt/regexp.c

导致性能上有一定的差异。LIKE的性能会好很多。

模糊查询、正则查询优化建议

由于RE的处理逻辑更加复杂，所以建议不需要正则表达式的时候，请使用LIKE，否则才使用正则表达式。

性能对比：

create or replace function gen_hanzi(int) returns text as $$     
declare     
  res text;     
begin     
  if $1 >=1 then     
    select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);     
    return res;     
  end if;     
  return null;     
end;     
$$ language plpgsql strict;     
   
   
postgres=# create table test(id int, info text);   
CREATE TABLE   
postgres=# insert into test select generate_series(1,100000), gen_hanzi(100);   
INSERT 0 100000   
postgres=# create index idx_test_1 on test using gin (info gin_trgm_ops);   
CREATE INDEX

正则查询写法，虽然用了索引，但是目前对wchar字符处理不够好，扫描了整个gin树。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '婐绷乂畳';   
                                                             QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------   
 Bitmap Heap Scan on public.test  (cost=45261409.28..45261421.30 rows=10 width=36) (actual time=583.810..816.503 rows=1 loops=1)   
   Output: id, info   
   Recheck Cond: (test.info ~ '婐绷乂畳'::text)   
   Rows Removed by Index Recheck: 99999   
   Heap Blocks: exact=4167   
   Buffers: shared hit=59783   
   ->  Bitmap Index Scan on idx_test_1  (cost=0.00..45261409.28 rows=10 width=0) (actual time=583.237..583.237 rows=100000 loops=1)   
         Index Cond: (test.info ~ '婐绷乂畳'::text)   
         Buffers: shared hit=55616   
 Planning time: 0.150 ms   
 Execution time: 816.545 ms   
(11 rows)

正则查询写法，对ascii字符效果是杠杠的。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '123';   
                                                      QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------   
 Bitmap Heap Scan on public.test  (cost=39.40..2897.60 rows=4000 width=36) (actual time=0.046..0.046 rows=0 loops=1)   
   Output: id, info   
   Recheck Cond: (test.info ~ '123'::text)   
   Buffers: shared hit=4   
   ->  Bitmap Index Scan on idx_test_1  (cost=0.00..38.40 rows=4000 width=0) (actual time=0.043..0.043 rows=0 loops=1)   
         Index Cond: (test.info ~ '123'::text)   
         Buffers: shared hit=4   
 Planning time: 0.146 ms   
 Execution time: 0.072 ms   
(9 rows)

LIKE 写法，不管是ascii字符还是wchar，效果都是杠杠的。

-- wchar   
   
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like '%婐绷乂畳%';   
                                                     QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------   
 Bitmap Heap Scan on public.test  (cost=13.28..25.30 rows=10 width=36) (actual time=0.042..0.042 rows=1 loops=1)   
   Output: id, info   
   Recheck Cond: (test.info ~~ '%婐绷乂畳%'::text)   
   Heap Blocks: exact=1   
   Buffers: shared hit=8   
   ->  Bitmap Index Scan on idx_test_1  (cost=0.00..13.27 rows=10 width=0) (actual time=0.027..0.027 rows=1 loops=1)   
         Index Cond: (test.info ~~ '%婐绷乂畳%'::text)   
         Buffers: shared hit=7   
 Planning time: 0.110 ms   
 Execution time: 0.108 ms   
(10 rows)   
   
-- ascii   
   
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~~ '%123%';   
                                                      QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------   
 Bitmap Heap Scan on public.test  (cost=39.40..2897.60 rows=4000 width=36) (actual time=0.018..0.018 rows=0 loops=1)   
   Output: id, info   
   Recheck Cond: (test.info ~~ '%123%'::text)   
   Buffers: shared hit=4   
   ->  Bitmap Index Scan on idx_test_1  (cost=0.00..38.40 rows=4000 width=0) (actual time=0.015..0.015 rows=0 loops=1)   
         Index Cond: (test.info ~~ '%123%'::text)   
         Buffers: shared hit=4   
 Planning time: 0.091 ms   
 Execution time: 0.046 ms   
(9 rows)

从上面两个测试来看，like和正则表达使用的操作符是不一样的：

                                            List of operators   
   Schema   | Name | Left arg type | Right arg type | Result type |  Function   |       Description          
------------+------+---------------+----------------+-------------+-------------+-------------------------   
 pg_catalog | ~~   | text          | text           | boolean     | textlike    | matches LIKE expression   
 pg_catalog | ~    | text          | text           | boolean     | textregexeq | matches regular expression, case-sensitive

对应的textlike, textregexeq。代码在：

src/backend/utils/adt/like.c

src/backend/utils/adt/regexp.c

目前，建议对前后模糊查询，使用LIKE表达式，或者~~表达式，可以达到最优的查询效果。不要使用正则表达式的写法。

参考

src/backend/utils/adt/like.c

src/backend/utils/adt/regexp.c

PostgreSQL 模糊查询与正则匹配性能差异与SQL优化建议

标签

背景

模糊查询、正则查询优化建议

参考

关系型数据库

热门文章

最新文章

相关产品

相关课程

相关电子书

推荐镜像

PostgreSQL 模糊查询 与 正则匹配 性能差异与SQL优化建议

标签

背景

模糊查询、正则查询 优化建议

参考

关系型数据库

热门文章

最新文章

相关产品

相关课程

相关电子书

推荐镜像

PostgreSQL 模糊查询与正则匹配性能差异与SQL优化建议

模糊查询、正则查询优化建议