Re: looking for a faster way to do that

From: hamann(dot)w(at)t-online(dot)de
To: pgsql-general(at)postgresql(dot)org
Subject: Re: looking for a faster way to do that
Date: 2011-09-23 07:45:51
Message-ID: 4E7C392F.mailGNP11KAXQ@amadeus3.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Alban Hertroys <haramrae(at)gmail(dot)com> wrote:

>> What is the output of explain?
>>
>> You say 'the other table', so presumably we're dealing with a foreign key
>> here. Is there an index on that column?

Albe Laurenz wrote:

>> Is the index used for "where code ~ '^ABC3563'"?
>>
>> If not, then the result is fast only because the table is scanned only once,
>> and it's just the factor of 3000 that's killing you.
>>
>> The second query (where code ~ wantcode) can never use an index because
>> the pattern "wantcode" is unknown at query planning time.
>>
>> Yours,
>> Laurenz Albe

Here I created a subset (just number and code matching a certain prefix)

\d items
Table "pg_temp_1.items"
Column | Type | Modifiers
--------+-----------------------+-----------
num | integer |
code | character varying(40) |
create index itemsc on items (code);

select count(*) from items;
count
-------
9614

A single anchored query
select * from items where code ~ '^ABC';
does indeed use the index to retrieve data.

Next I copied a file of wanted codes

create temp table n (wantcode text);
\copy n from /tmp/rmartin.tmp

the file contains plain names, i.e. unanchored matches

explain analyze select num, n.wantcode from items, n where items.code ~ n.wantcode;
Nested Loop (cost=20.00..216502.14 rows=48070 width=36) (actual time=148.479..336280.488 rows=2871 loops=1)
Join Filter: (("outer".code)::text ~ "inner".wantcode)
-> Seq Scan on items (cost=0.00..167.14 rows=9614 width=42) (actual time=0.048..38.666 rows=9614 loops=1)
-> Materialize (cost=20.00..30.00 rows=1000 width=32) (actual time=0.001..1.049 rows=815 loops=9614)
-> Seq Scan on n (cost=0.00..20.00 rows=1000 width=32) (actual time=0.003..1.839 rows=815 loops=1)
Total runtime: 336286.692 ms

An exact match "where items.code = n.wantcode" on the same data completes in 40 ms

BTW: indexing the second table does not affect the query plan or the runtime, it just shows
actual row count rather than estimate.

This is, of course, bad; an anchored match could be faster and also is more appropriate
to the scenario. So I change the contents of the second table

update n set wantcode = textcat('^', wantcode);

and try again, with similar results
Nested Loop (cost=14.15..176478.01 rows=39178 width=36) (actual time=125.114..308831.697 rows=2871 loops=1)
Join Filter: (("outer".code)::text ~ "inner".wantcode)
-> Seq Scan on items (cost=0.00..167.14 rows=9614 width=42) (actual time=0.061..2034.572 rows=9614 loops=1)
-> Materialize (cost=14.15..22.30 rows=815 width=32) (actual time=0.001..1.095 rows=815 loops=9614)
-> Seq Scan on n (cost=0.00..14.15 rows=815 width=32) (actual time=0.114..1.893 rows=815 loops=1)
Total runtime: 308837.746 ms

I am aware that this is unlikely to work fast (the planner would perhaps need a hint in the query
rather than in the data column to choose an anchored match algorithm (in case there is
such an algo, of course)

So I wonder whether there might be a different approach to this problem rather than
pattern matching.
I recall I had a similar problem before with a "contacts" column possibly containing one or more
email addresses. Here searches would also be number of people times number of requests
performance. I finally ended up with a @@ match (contrib/tsquery) and a supporting GIST index,
but that only supports exact match, not prefix

Regards
Wolfgang Hamann

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andrew Rose 2011-09-23 09:47:15 Relative performance of prefix and suffix string matching
Previous Message Mike Christensen 2011-09-23 07:03:20 Re: Materialized views in Oracle