Re: Extremely slow intarray index creation and inserts.

From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, pgsql-performance(at)postgreSQL(dot)org
Subject: Re: Extremely slow intarray index creation and inserts.
Date: 2009-03-18 16:12:29
Message-ID: 49C11D6D.8020604@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Tom Lane wrote:
> Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com> writes:
>> vm=# create index "gist70000" on tmp_intarray_test using GIST (my_int_array gist__int_ops);
>> CREATE INDEX
>> Time: 2069836.856 ms
>
>> Is that expected, or does it sound like a bug to take over
>> half an hour to index 70000 rows of mostly 5 and 6-element
>> integer arrays?
>
> I poked at this example with oprofile. It's entirely CPU-bound AFAICT,

Oleg pointed out to me (off-list I now see) that it's not totally
unexpected behavior and I should have been using gist__intbig_ops,
since the "big" refers to the cardinality of the entire set (which
was large, in my case) and not the length of the arrays.

Oleg Bartunov wrote:
OB:> it's not about short or long arrays, it's about small or big
OB:> cardinality of the whole set (the number of unique elements)

I'm re-reading the docs and still wasn't obvious to me. A
potential docs patch is attached below.

> and the CPU utilization is approximately
>
> 55% g_int_compress
> 35% memmove/memcpy (difficult to distinguish these)
> 1% pg_qsort
> <1% anything else
>
> Probably need to look at reducing the number of calls to g_int_compress
> ... it must be getting called a whole lot more than once per new index
> entry, and I wonder why that should need to be.

Perhaps that's a separate issue, but we're working
fine with gist__intbig_ops for the time being.

Here's a proposed docs patch that makes this more obvious.

*** a/doc/src/sgml/intarray.sgml
--- b/doc/src/sgml/intarray.sgml
***************
*** 239,245 ****
<literal>gist__int_ops</> (used by default) is suitable for
small and medium-size arrays, while
<literal>gist__intbig_ops</> uses a larger signature and is more
! suitable for indexing large arrays.
</para>

<para>
--- 239,247 ----
<literal>gist__int_ops</> (used by default) is suitable for
small and medium-size arrays, while
<literal>gist__intbig_ops</> uses a larger signature and is more
! suitable for indexing high-cardinality data sets - where there
! are a large number of unique elements across all rows being
! indexed.
</para>

<para>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Carey 2009-03-18 17:43:18 Re: Proposal of tunable fix for scalability of 8.4
Previous Message Matthew Wakeling 2009-03-18 13:49:05 Re: Proposal of tunable fix for scalability of 8.4