Re: Optimizing query?

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: hamann(dot)w(at)t-online(dot)de
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Optimizing query?
Date: 2013-02-01 05:50:23
Message-ID: CAFj8pRCNHg9-4KKyhkF_X3Z3=svSvzd88RW1MB2a4405QS4vWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello

2013/1/31 <hamann(dot)w(at)t-online(dot)de>:
>
> Pavel Stehlule wrote:
>
>>> >> Hi,
>>> >>
>>> >> I am trying to match items from 2 tables based on a common string.
>>> >> One is a big table which has one column with entries like XY123, ABC44, =
>>> etc
>>> >> The table has an index on that column.
>>> >> The second table is, typically, much smaller
>>> >>
>>> >> select .... from tab1, tab2 where tab1.code =3D tab2.code;
>>> >>
>>> >> This works fine and fast.
>>> >> Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D=
>>> in the
>>> >> big table and want them to match XY423, GF55 in the second table
>>> >>
>>> >> Variants I have tried
>>> >>
>>> >> select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])');
>>> >> select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z=
>>> ])');
>>> >>
>>> >
>>> > Have you tried the substring function?
>>> >
>>> > select .... from tab1, tab2 where substring(tab1.code from 1 for 5) =3D
>>> > tab2.code
>>> >
>
> Hi Pavel, it was just by chance that a fixed size substring would match the
> data at hand. It is more common to have a digit/letter (or vice versa) boundary
> or a hyphen there
>
>>> >
>>> >> both take an enormous time. In the better case that I can subset (e.g. a=
>>> ll candidates in table 2
>>> >> share initial "AX") I get back to manageable times by adding
>>> >> and tab1.code ~ '^AX'
>>> >> into the recipe. Actual runtime with about a million entries in tab1 and=
>>> 800 entries in tab2
>>> >> is about 40 seconds.
>>>
>>> any join where result is related to some function result can be very
>>> slow, because estimation will be out and any repeated function
>>> evaluation is just expensive.
>>>
> I see the problem since obviously every the ~ operator with a non-constant
> pattern is constantly recompiling the pattern.
>
> I wonder whether it would be possible to invent a prefix-match operator that approaches
> the performance of string equality. I noted in the past (not sure whether anything
> has changed in regex matching) that a constant leading part of regex would improve
> performance, i.e. use an index scan to select possible candidates.
>
>>> You can try use a functional index.
>>>
>>> create index on tab2 ((substring(tab1.code from 1 for 5))
>>>
>
> What kind of trick is that - mixing two tables into a functional index?

it is not possible - you can do some auxiliary table and creating
indexes over this table

but maybe https://github.com/dimitri/prefix can help

Regards

Pavel

> What would the exact syntax be for that?
>
> Regards
> Wolfgang Hamann
>
>
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Vlad Bailescu 2013-02-01 08:14:29 Unusually high IO for autovacuum worker
Previous Message Carlo Stonebanks 2013-02-01 05:19:05 Re: Pg & Tcl - is it dying out?