Re: WIP: index support for regexp search

From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: "Alexander Korotkov" <aekorotkov(at)gmail(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: index support for regexp search
Date: 2012-01-19 23:33:55
Message-ID: 1ce6dc4baa5e8293bd44ee1458b98d2c.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, January 19, 2012 21:30, Heikki Linnakangas wrote:
> On 22.11.2011 21:38, Alexander Korotkov wrote:
>> WIP patch with index support for regexp search for pg_trgm contrib is
>> attached.
>> In spite of techniques which extracts continuous text parts from regexp,
>> this patch presents technique of automatum transformation. That allows more
>> comprehensive trigrams extraction.
>
> Nice!
>

Yes, wonderful stuff; I tested quite a bit with this patch; FWIW, here's what I found.

The patch yields spectacular speedups with small, simple-enough regexen. But it does not do a
good enough job when guessing where to use the index and where fall back to Seq Scan. This can
lead to (also spectacular) slow-downs, compared to Seq Scan.

I used the following to generate 3 test tables with lines of 80 random chars (just in case it's
handy for others to use):

$ cat create_data.sh
#!/bin/sh

for power in 4 5 6
do
table=azjunk${power}
index=${table}_trgmrgx_txt_01_idx
echo "-- generating table $table with index $index";
time perl -E'
sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z"))
for 1 .. 1e'"${power};" \
| psql -aqXc "drop table if exists $table;
create table $table(txt text); copy $table from stdin;
-- set session maintenance_work_mem='20GB';
create index $index on $table using gin(txt gin_trgm_ops);
analyze $table;";
done

\dt+ public.azjunk*
List of relations
Schema | Name | Type | Owner | Size | Description
--------+---------+-------+-----------+---------+-------------
public | azjunk4 | table | breinbaas | 1152 kB |
public | azjunk5 | table | breinbaas | 11 MB |
public | azjunk6 | table | breinbaas | 112 MB |
(3 rows)

I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in your patch), is that
true? I can understand you want that value to be low to limit the above risk, but now it reduces
the usability of the feature a bit: one has to split up larger char-classes into several smaller
ones to make a statement use the index: i.e.:
txt ~ 'f[aeio]n' OR txt ~ 'f[uy]n'
instead of
txt ~ 'f[aeiouy]n'

I made compiled instances with larger values for MAX_COLOR_CHARS (6 and 9), and sure enough they
used the index for larger classes such as the above, but of course also got into problems easier
when quantifiers are added (*, +, {n,m}).

A better calculation to decide index-use seems necessary, and ideally one that allows for a larger
MAX_COLOR_CHARS than 4. Especially quantifiers could perhaps be inspected wrt that decision.
IMHO, the functionality would still be very useful when only very simple regexen were considered.

Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is
normally necessary for perfomance reasons, but it would be useful te be able to compile a test
version that allows it. I don't know how hard that would be.

There is also a minor bug, I think, when running with 'set enable_seqscan=off' in combination
with a too-large regex:

$ cat fail.sql:

set enable_bitmapscan=on;
set enable_seqscan=on;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n'; -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n'; -- OK

set enable_bitmapscan=on;
set enable_seqscan=off;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n'; -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n'; -- crashes

$ psql -f fail.sql
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on azjunk4 (cost=52.01..56.02 rows=1 width=81) (actual time=1.011..5.291
rows=131 loops=1)
Recheck Cond: (txt ~ 'f[aeio]n'::text)
-> Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx (cost=0.00..52.01 rows=1 width=0) (actual
time=0.880..0.880 rows=131 loops=1)
Index Cond: (txt ~ 'f[aeio]n'::text)
Total runtime: 5.700 ms
(5 rows)

QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on azjunk4 (cost=0.00..268.00 rows=1 width=81) (actual time=1.491..36.049 rows=164
loops=1)
Filter: (txt ~ 'f[aeiou]n'::text)
Rows Removed by Filter: 9836
Total runtime: 36.112 ms
(4 rows)

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on azjunk4 (cost=52.01..56.02 rows=1 width=81) (actual time=0.346..0.927
rows=131 loops=1)
Recheck Cond: (txt ~ 'f[aeio]n'::text)
-> Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx (cost=0.00..52.01 rows=1 width=0) (actual
time=0.316..0.316 rows=131 loops=1)
Index Cond: (txt ~ 'f[aeio]n'::text)
Total runtime: 0.996 ms
(5 rows)

psql:fail.sql:24: connection to server was lost

Sorry: I found the code, esp. the regex engine hard to understand sofar, so all the above are
somewhat inconclusive remarks; I hope nonetheless they are useful.

Thanks,

Erik Rijkers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-01-19 23:53:42 Re: WIP -- renaming implicit sequences
Previous Message Alvaro Herrera 2012-01-19 23:30:05 Re: WIP -- renaming implicit sequences