Skip site navigation (1) Skip section navigation (2)

WIP: index support for regexp search

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: WIP: index support for regexp search
Date: 2011-11-22 19:38:37
Message-ID: CAPpHfduD6EGNise5codBz0KcdDahp7--MhFz_JDD_FRPC7-i=A@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Hackers,

WIP patch with index support for regexp search for pg_trgm contrib is
attached.
In spite of techniques which extracts continuous text parts from regexp,
this patch presents technique of automatum transformation. That allows more
comprehensive trigrams extraction.

A little example of possible perfomance benefit.

test=# explain analyze select * from words where s ~ 'a[bc]+[de]';
                                              QUERY PLAN


----------------------------------------------------------------------------------------------------
---
 Seq Scan on words  (cost=0.00..1703.11 rows=10 width=9) (actual
time=3.092..242.303 rows=662 loops=1)
   Filter: (s ~ 'a[bc]+[de]'::text)
   Rows Removed by Filter: 97907
 Total runtime: 243.213 ms
(4 rows)

test=# explain analyze select * from words where s ~ 'a[bc]+[de]';
                                                         QUERY PLAN


----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on words  (cost=260.08..295.83 rows=10 width=9) (actual
time=4.166..7.506 rows=662 loops=1)
   Recheck Cond: (s ~ 'a[bc]+[de]'::text)
   Rows Removed by Index Recheck: 18
   ->  Bitmap Index Scan on words_trgm_idx  (cost=0.00..260.07 rows=10
width=0) (actual time=4.076..4.076 rows=680 loops=1)
         Index Cond: (s ~ 'a[bc]+[de]'::text)
 Total runtime: 8.424 ms
(6 rows)

Current version of patch have some limitations:
1) Algorithm of logical expression extraction on trigrams have high
computational complexity. So, it can become really slow on regexp with many
branches. Probably, improvements of this algorithm is possible.
2) Surely, no perfomance benefit if no trigrams can be extracted from
regexp. It's inevitably.
3) Currently, only GIN index is supported. There are no serious problems,
GiST code for it just not written yet.
4) It appear to be some kind of problem to extract multibyte encoded
character from pg_wchar. I've posted question about it here:
http://archives.postgresql.org/pgsql-hackers/2011-11/msg01222.php
While I've hardcoded some dirty solution. So
PG_EUC_JP, PG_EUC_CN, PG_EUC_KR, PG_EUC_TW, PG_EUC_JIS_2004 are not
supported yet.

------
With best regards,
Alexander Korotkov.

Attachment: trgm-regexp-0.1.patch.gz
Description: application/x-gzip (6.0 KB)

Responses

pgsql-hackers by date

Next:From: Alvaro HerreraDate: 2011-11-22 19:40:36
Subject: Re: vpath builds and verbose error messages
Previous:From: Peter EisentrautDate: 2011-11-22 19:22:15
Subject: Re: vpath builds and verbose error messages

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group