Quick Links

Re: IN() statement values order makes 2x performance hit

From:	Alexey Kupershtokh <alexey(dot)kupershtokh(at)gmail(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: IN() statement values order makes 2x performance hit
Date:	2008-05-29 09:52:59
Message-ID:	483E7CFB.7010800@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Thanks for the response.
I've taken a look at this feature. But it seems unapplicable to my case:
this table is not a many2many relation which seems the most common case
of the intarray usage.
The table just stores an information about items (rss posts): what feeds
(rss) are they from, and their publication date. Id in the table is PK.

Oleg Bartunov wrote:
> You may try contrib/intarray, which we developed specially for
> denormalization.
>
> Oleg
> On Thu, 29 May 2008, Alexey Kupershtokh wrote:
>
>> Hello everybody!
>>
>> I have found a performance issue with 2 equivalent queries stably taking
>> different (~x2) time to finish. In just a few words it can be described
>> like this: if you have a lot of values in an IN() statement, you should
>> put most heavy (specifying most number of rows) ids first.
>> This is mostly just a bug submit, than looking for help.
>>
>> So this is what I have:
>> * RHEL
>> * PostgreSQL 8.3.1
>> * A table ext_feeder_item with ~4.6M records.
>> kia=# \d+ ext_feeder_item;
>> Table "public.ext_feeder_item"
>> Column | Type | Modifiers | Description
>> ----------+--------------------------+------------------------------------------
>>
>> --------------------+-------------
>> id | bigint | not null default
>> nextval('ext_feeder_item_id_seq'::regclass) |
>> feed_id | bigint | not null |
>> pub_date | timestamp with time zone | |
>> Indexes:
>> "ext_feeder_item_pkey" PRIMARY KEY, btree (id)
>> "ext_feeder_item_feed_id_pub_date_idx" btree (feed_id, pub_date)
>> "ext_feeder_item_idx" btree (feed_id)
>> Triggers:
>> ....
>> Has OIDs: no
>> * Statistics for the fields feed_id and pub_date are set to 1000;
>> * The table have just been vacuumed and analyzed.
>> * A simple query to the table:
>> SELECT
>> id
>> FROM
>> ext_feeder_item AS i
>> WHERE
>> i.feed_id IN (...)
>> ORDER BY pub_date DESC, id DESC
>> LIMIT 11 OFFSET 0;
>>
>> with many (~1200) ids in the IN() statement.
>> * The count of rows distribution for these ids (may be thought of as
>> foreign keys in this table) is the following:
>> id = 54461: ~180000 - actually the most heavy id in the whole table.
>> other ids: a single id at most specifies 2032 rows; 6036 rows total.
>> * If I perform a query with
>> IN(54461, ...)
>> it stably (5 attempts) takes 4.5..4.7 secs. to perform.
>> QUERY PLAN
>> Limit (cost=1463104.22..1463104.25 rows=11 width=16) (actual
>> time=4585.420..4585.452 rows=11 loops=1)
>> -> Sort (cost=1463104.22..1464647.29 rows=617228 width=16)
>> (actual time=4585.415..4585.425 rows=11 loops=1)
>> Sort Key: pub_date, id
>> Sort Method: top-N heapsort Memory: 17kB
>> -> Bitmap Heap Scan on ext_feeder_item i
>> (cost=13832.40..1449341.79 rows=617228 width=16) (actual
>> time=894.622..4260.441 rows=185625 loops=1)
>> Recheck Cond: (feed_id = ANY ('{54461,
>> ...}'::bigint[]))
>> -> Bitmap Index Scan on ext_feeder_item_idx
>> (cost=0.00..13678.10 rows=617228 width=0) (actual
>> time=884.686..884.686 rows=185625 loops=1)
>> Index Cond: (feed_id = ANY ('{54461,
>> ...}'::bigint[]))
>> Total runtime: 4585.852 ms
>> * If I perform a query with
>> IN(..., 54461)
>> it stably (5 attempts) takes 9.3..9.5 secs. to perform.
>> QUERY PLAN
>> Limit (cost=1463104.22..1463104.25 rows=11 width=16) (actual
>> time=9330.267..9330.298 rows=11 loops=1)
>> -> Sort (cost=1463104.22..1464647.29 rows=617228 width=16)
>> (actual time=9330.263..9330.273 rows=11 loops=1)
>> Sort Key: pub_date, id
>> Sort Method: top-N heapsort Memory: 17kB
>> -> Bitmap Heap Scan on ext_feeder_item i
>> (cost=13832.40..1449341.79 rows=617228 width=16) (actual
>> time=1018.401..8971.029 rows=185625 loops=1)
>> Recheck Cond: (feed_id = ANY ('{...
>> ,54461}'::bigint[]))
>> -> Bitmap Index Scan on ext_feeder_item_idx
>> (cost=0.00..13678.10 rows=617228 width=0) (actual
>> time=1008.791..1008.791 rows=185625 loops=1)
>> Index Cond: (feed_id = ANY ('{...
>> ,54461}'::bigint[]))
>> Total runtime: 9330.729 ms
>> I don't know what are the roots of the problem, but I think that some
>> symptomatic healing could be applied: the PostgreSQL could sort the IDs
>> due to the statistics.
>> So currently I tend to select the IDs from another table ordering them
>> due to their weights: it's easy for me thanks to denormalization.
>>
>> Also I would expect from PostgreSQL that it sorted the values to make
>> index scan more sequential, but this expectation already conflicts with
>> the bug described above :)
>>
>>
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Re: IN() statement values order makes 2x performance hit at 2008-05-29 09:39:57 from Oleg Bartunov

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Joshua D. Drake	2008-05-29 15:45:14	Re: 2GB or not 2GB
Previous Message	Oleg Bartunov	2008-05-29 09:39:57	Re: IN() statement values order makes 2x performance hit