Re: looking for a faster way to do that

From: Alban Hertroys <haramrae(at)gmail(dot)com>
To: hamann(dot)w(at)t-online(dot)de
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: looking for a faster way to do that
Date: 2011-09-23 10:39:33
Message-ID: CAF-3MvNz1CSwnPh14grK80pQV=bNhTFya_zt0BnQ5s0um38zag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 23 September 2011 09:45, <hamann(dot)w(at)t-online(dot)de> wrote:

> Alban Hertroys <haramrae(at)gmail(dot)com> wrote:
>
> >> What is the output of explain?
> >>
> >> You say 'the other table', so presumably we're dealing with a foreign
> key
> >> here. Is there an index on that column?
>
> Albe Laurenz wrote:
>
> >> Is the index used for "where code ~ '^ABC3563'"?
> >>
> >> If not, then the result is fast only because the table is scanned only
> once,
> >> and it's just the factor of 3000 that's killing you.
> >>
> >> The second query (where code ~ wantcode) can never use an index because
> >> the pattern "wantcode" is unknown at query planning time.
> >>
> >> Yours,
> >> Laurenz Albe
>
>
> Here I created a subset (just number and code matching a certain prefix)
>
> \d items
> Table "pg_temp_1.items"
> Column | Type | Modifiers
> --------+-----------------------+-----------
> num | integer |
> code | character varying(40) |
> create index itemsc on items (code);
>
> select count(*) from items;
> count
> -------
> 9614
>
> A single anchored query
> select * from items where code ~ '^ABC';
> does indeed use the index to retrieve data.
>
> Next I copied a file of wanted codes
>
> create temp table n (wantcode text);
> \copy n from /tmp/rmartin.tmp
>
> the file contains plain names, i.e. unanchored matches
>
> explain analyze select num, n.wantcode from items, n where items.code ~
> n.wantcode;
> Nested Loop (cost=20.00..216502.14 rows=48070 width=36) (actual
> time=148.479..336280.488 rows=2871 loops=1)
> Join Filter: (("outer".code)::text ~ "inner".wantcode)
> -> Seq Scan on items (cost=0.00..167.14 rows=9614 width=42) (actual
> time=0.048..38.666 rows=9614 loops=1)
> -> Materialize (cost=20.00..30.00 rows=1000 width=32) (actual
> time=0.001..1.049 rows=815 loops=9614)
> -> Seq Scan on n (cost=0.00..20.00 rows=1000 width=32) (actual
> time=0.003..1.839 rows=815 loops=1)
> Total runtime: 336286.692 ms
>

So you're comparing a variable field value to a variable pattern - yeah,
that's going to hurt. There's no way you could index exactly that.

Perhaps there's some way you can transform the problem so that you get
something indexable?
For example, if your match patterns follow a certain pattern by themselves,
you could add a column with the longest match pattern that would match the
string. Then you could just do a query for which records have the match
pattern (in that new column) that you're looking for and voila!

If something like that is possible strongly depends on what kind of match
patterns you're using, of course.

> An exact match "where items.code = n.wantcode" on the same data completes
> in 40 ms
>

That's an exact string match, of course that will be fast ;)

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alban Hertroys 2011-09-23 10:41:02 Re: Query performs badly with materialize node
Previous Message Gregg Jaskiewicz 2011-09-23 10:34:12 get number and names of processes connected to postgresql