Re: looking for a faster way to do that

From: hamann(dot)w(at)t-online(dot)de
To: pgsql-general(at)postgresql(dot)org
Subject: Re: looking for a faster way to do that
Date: 2011-09-25 06:04:43
Message-ID: wolfgang-1110925080443.A0225425@amadeus3.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Eduardo Morras wrote:

>> >
>> >Hi,
>> >
>> >if I understand this right, it does not mean "check if the string
>> >appears at position 0"
>> >which could translate into an index query, but rather "check if the
>> >string appears anywhere
>> >and then check if that is position 0", so the entire table is checked.
>>
>> The second one yes, as it checks all patterns you want only one time
>> per row they only needs one table scan. The first one eliminates the
>> substring 'ABC' from the string, if the lengths of both strings are
>> equal, the substring 'ABC' wasn't in it. If they are different, the
>> trimmed string will be shorter.
>>
>> >explain analyze select items.num, wantcode from items, n where
>> >strpos(code, wantcode) = 0;
>> > Nested Loop (cost=167.14..196066.54 rows=39178 width=36) (actual
>> > time=0.074..36639.312 rows=7832539 loops=1)
>> > Join Filter: (strpos(("inner".code)::text, "outer".wantcode) = 0)
>> > -> Seq Scan on n (cost=0.00..14.15 rows=815 width=32) (actual
>> > time=0.005..2.212 rows=815 loops=1)
>> > -> Materialize (cost=167.14..263.28 rows=9614 width=42)
>> > (actual time=0.007..13.970 rows=9614 loops=815)
>> > -> Seq Scan on items (cost=0.00..167.14 rows=9614
>> > width=42) (actual time=0.044..14.855 rows=9614 loops=1)
>> > Total runtime: 46229.836 ms
>> >
>> >
>> >The query ran much faster than the pattern query, however. This
>> >seems to be the performance
>> >of just searching for a plain string vs. initializing the regex
>> >engine every time (for 815
>> >queries in a test set)
>>
>> It will do only one table scan while your original code will do one
>> for each substring you want to test. You can add more and more
>> substrings without too much cost. If you want to use the regex engine
>> instead the postgresql string funtions check the regexp_matches(), it
>> should be faster if you have 3000 substrings.
>>
>> select * from items where regexp_matches(items.code,'(ABC) (DE1)
>> (any_substring)')<>{};
>>

Hi Eduardo,

it is clear that scanning the table once with a list of matches will outperform
rescanning the table for every string wanted. Now, my problem is that the patterns are
dynamic as well. So if I could translate a table with one column and a few thousand rows
into something like
regexp_matches(code,'string1|string2|.....string2781')
would ideally be a performant query. Unfortunately I have no idea how I could achieve this
transformation inside the database. Doing it externally fails, because any single query cannot
be more than so few characters.

Regards
Wolfgang Hamann

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Albretch Mueller 2011-09-25 06:11:36 Re: (another ;-)) PostgreSQL-derived project ...
Previous Message Reuven M. Lerner 2011-09-25 05:43:15 Re: Speed of lo_unlink vs. DELETE on BYTEA