Re: IN() statement values order makes 2x performance hit

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Alexey Kupershtokh <alexey(dot)kupershtokh(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: IN() statement values order makes 2x performance hit
Date: 2008-05-29 09:39:57
Message-ID: Pine.LNX.4.64.0805291339120.21547@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

You may try contrib/intarray, which we developed specially for
denormalization.

Oleg
On Thu, 29 May 2008, Alexey Kupershtokh wrote:

> Hello everybody!
>
> I have found a performance issue with 2 equivalent queries stably taking
> different (~x2) time to finish. In just a few words it can be described
> like this: if you have a lot of values in an IN() statement, you should
> put most heavy (specifying most number of rows) ids first.
> This is mostly just a bug submit, than looking for help.
>
> So this is what I have:
> * RHEL
> * PostgreSQL 8.3.1
> * A table ext_feeder_item with ~4.6M records.
> kia=# \d+ ext_feeder_item;
> Table "public.ext_feeder_item"
> Column | Type | Modifiers | Description
> ----------+--------------------------+------------------------------------------
> --------------------+-------------
> id | bigint | not null default
> nextval('ext_feeder_item_id_seq'::regclass) |
> feed_id | bigint | not null |
> pub_date | timestamp with time zone | |
> Indexes:
> "ext_feeder_item_pkey" PRIMARY KEY, btree (id)
> "ext_feeder_item_feed_id_pub_date_idx" btree (feed_id, pub_date)
> "ext_feeder_item_idx" btree (feed_id)
> Triggers:
> ....
> Has OIDs: no
> * Statistics for the fields feed_id and pub_date are set to 1000;
> * The table have just been vacuumed and analyzed.
> * A simple query to the table:
> SELECT
> id
> FROM
> ext_feeder_item AS i
> WHERE
> i.feed_id IN (...)
> ORDER BY pub_date DESC, id DESC
> LIMIT 11 OFFSET 0;
>
> with many (~1200) ids in the IN() statement.
> * The count of rows distribution for these ids (may be thought of as
> foreign keys in this table) is the following:
> id = 54461: ~180000 - actually the most heavy id in the whole table.
> other ids: a single id at most specifies 2032 rows; 6036 rows total.
> * If I perform a query with
> IN(54461, ...)
> it stably (5 attempts) takes 4.5..4.7 secs. to perform.
> QUERY PLAN
> Limit  (cost=1463104.22..1463104.25 rows=11 width=16) (actual
> time=4585.420..4585.452 rows=11 loops=1)
>   ->  Sort  (cost=1463104.22..1464647.29 rows=617228 width=16)
> (actual time=4585.415..4585.425 rows=11 loops=1)
>         Sort Key: pub_date, id
>         Sort Method:  top-N heapsort  Memory: 17kB
>         ->  Bitmap Heap Scan on ext_feeder_item i
> (cost=13832.40..1449341.79 rows=617228 width=16) (actual
> time=894.622..4260.441 rows=185625 loops=1)
>               Recheck Cond: (feed_id = ANY ('{54461,
> ...}'::bigint[]))
>               ->  Bitmap Index Scan on ext_feeder_item_idx
> (cost=0.00..13678.10 rows=617228 width=0) (actual
> time=884.686..884.686 rows=185625 loops=1)
>                     Index Cond: (feed_id = ANY ('{54461,
> ...}'::bigint[]))
> Total runtime: 4585.852 ms
> * If I perform a query with
> IN(..., 54461)
> it stably (5 attempts) takes 9.3..9.5 secs. to perform.
> QUERY PLAN
> Limit  (cost=1463104.22..1463104.25 rows=11 width=16) (actual
> time=9330.267..9330.298 rows=11 loops=1)
>   ->  Sort  (cost=1463104.22..1464647.29 rows=617228 width=16)
> (actual time=9330.263..9330.273 rows=11 loops=1)
>         Sort Key: pub_date, id
>         Sort Method:  top-N heapsort  Memory: 17kB
>         ->  Bitmap Heap Scan on ext_feeder_item i
> (cost=13832.40..1449341.79 rows=617228 width=16) (actual
> time=1018.401..8971.029 rows=185625 loops=1)
>               Recheck Cond: (feed_id = ANY ('{...
> ,54461}'::bigint[]))
>               ->  Bitmap Index Scan on ext_feeder_item_idx
> (cost=0.00..13678.10 rows=617228 width=0) (actual
> time=1008.791..1008.791 rows=185625 loops=1)
>                     Index Cond: (feed_id = ANY ('{...
> ,54461}'::bigint[]))
> Total runtime: 9330.729 ms
> I don't know what are the roots of the problem, but I think that some
> symptomatic healing could be applied: the PostgreSQL could sort the IDs
> due to the statistics.
> So currently I tend to select the IDs from another table ordering them
> due to their weights: it's easy for me thanks to denormalization.
>
> Also I would expect from PostgreSQL that it sorted the values to make
> index scan more sequential, but this expectation already conflicts with
> the bug described above :)
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Alexey Kupershtokh 2008-05-29 09:52:59 Re: IN() statement values order makes 2x performance hit
Previous Message Alexey Kupershtokh 2008-05-29 09:38:38 IN() statement values order makes 2x performance hit