Re: Collation-aware comparisons in GIN opclasses

From: Oleg Bartunov <obartunov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collation-aware comparisons in GIN opclasses
Date: 2014-09-29 08:08:35
Message-ID: CAF4Au4zYsJUmqH=zSiBrUuN2HTBr8Yhq8a8T+T707a0E0CPCmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 29, 2014 at 11:48 AM, Heikki Linnakangas <
hlinnakangas(at)vmware(dot)com> wrote:

> On 09/15/2014 06:28 PM, Alexander Korotkov wrote:
>
>> Hackers,
>>
>> some GIN opclasses uses collation-aware comparisons while they don't need
>> to do especially collation-aware comparison. Examples are text[] and
>> hstore
>> opclasses.
>>
>
> Hmm. It would be nice to use the index for inequality searches, at least
> on text[]. We don't support that currently, but it would require
> collation-awareness.
>
> Depending on collation this may make them a much slower.
>>
>> See example.
>>
>> # show lc_collate ;
>> lc_collate
>> ─────────────
>> ru_RU.UTF-8
>> (1 row)
>>
>> # create table test as (select array_agg(i::text) from
>> generate_series(1,1000000) i group by (i-1)/10);
>> SELECT 100000
>>
>> # create index test_idx on test using gin(array_agg);
>> CREATE INDEX
>> Time: *26930,423 ms*
>>
>> # create index test_idx2 on test using gin(array_agg collate "C");
>> CREATE INDEX
>> Time: *5143,682 ms*
>>
>> Index creation with collation "ru_RU.UTF-8" is 5 times slower while
>> collation has absolutely no effect on index functionality.
>>
>
> It occurs to me that practically all of those comparisons happen when we
> populate the red-black Tree, during the index build. The purpose of the
> red-black tree is to collect identical keys together, but there is actually
> no requirement that the order of the red-black tree matches the order of
> the index. It also isn't strictly required that it recognizes equal keys as
> equal. The only requirement is that it doesn't incorrectly put two keys
> that are equal according to the compare-function, into two different nodes.
>
>
Good point, Heikki. I experienced several times this problem, fixed it with
C-locale and forgot again. Now, it's time to fix !

> We could therefore use plain memcmp() to compare the Datums while building
> the red-black tree. Keys that are bit-wise equal are surely considered as
> equal by the compare-function. That makes the index build a lot faster.
> With the attached quick patch:
>
> postgres=# create index test_idx on test using gin(array_agg );
> CREATE INDEX
> Time: 880.620 ms
>
> This is on my laptop. Without the patch, that takes about 4.7 seconds with
> the C locale, so this is much faster than even using the C locale.
>

Hmm, on my MBA I got
17277.734 (patch) vs 39151.562 for ru_RU.UTF-8 and
6131.929 (patch) vs 6131.929 for C

Not much :(

>
> - Heikki
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2014-09-29 08:10:26 Re: INSERT ... ON CONFLICT {UPDATE | IGNORE}
Previous Message Heikki Linnakangas 2014-09-29 07:48:43 Re: Collation-aware comparisons in GIN opclasses