IN() statement values order makes 2x performance hit

From: Alexey Kupershtokh <alexey(dot)kupershtokh(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: IN() statement values order makes 2x performance hit
Date: 2008-05-29 09:38:38
Message-ID: 483E799E.1070404@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
Hello everybody!<br>
<br>
I have found a performance issue with 2 equivalent queries stably
taking different (~x2) time to finish. In just a few words it can be
described like this: if you have a lot of values in an IN() statement,
you should put most heavy (specifying most number of rows) ids first.<br>
This is mostly just a bug submit, than looking for help.<br>
<br>
So this is what I have:
<ul>
<li> RHEL</li>
<li> PostgreSQL 8.3.1</li>
<li> A table ext_feeder_item with ~4.6M records.<br>
<tt>kia=# \d+ ext_feeder_item;<br>
Table
"public.ext_feeder_item"<br>
Column | Type | Modifiers | Description<br>
----------+--------------------------+--------------------------------------------------------------+-------------<br>
id | bigint | not null default
nextval('ext_feeder_item_id_seq'::regclass) |<br>
feed_id | bigint | not
null |<br>
pub_date | timestamp with time zone
| |<br>
Indexes:<br>
"ext_feeder_item_pkey" PRIMARY KEY, btree (id)<br>
"ext_feeder_item_feed_id_pub_date_idx" btree (feed_id, pub_date)<br>
"ext_feeder_item_idx" btree (feed_id)<br>
Triggers:<br>
....<br>
Has OIDs: no</tt><tt><br>
</tt></li>
<li>Statistics for the fields feed_id and pub_date are set to 1000;</li>
<li>The table have just been vacuumed and analyzed.</li>
<li>A simple query to the table:<br>
<tt> SELECT<br>
id<br>
FROM<br>
ext_feeder_item AS i<br>
WHERE<br>
i.feed_id IN (...)<br>
ORDER BY pub_date DESC, id DESC<br>
LIMIT 11 OFFSET 0;<br>
<br>
</tt>with many (~1200) ids in the IN() statement.</li>
<li>The count of rows distribution for these ids (may be thought of
as foreign keys in this table) is the following:<br>
id = 54461: ~180000 - actually the most heavy id in the whole table.<br>
other ids: a single id at most specifies 2032 rows; 6036 rows total.</li>
<li>If I perform a query with<br>
<tt>IN(54461, ...)</tt><br>
it stably (5 attempts) takes 4.5..4.7 secs. to perform.<br>
<tt>QUERY PLAN<br>
Limit  (cost=1463104.22..1463104.25 rows=11 width=16) (actual
time=4585.420..4585.452 rows=11 loops=1)<br>
  -&gt;  Sort  (cost=1463104.22..1464647.29 rows=617228 width=16)
(actual time=4585.415..4585.425 rows=11 loops=1)<br>
        Sort Key: pub_date, id<br>
        Sort Method:  top-N heapsort  Memory: 17kB<br>
        -&gt;  Bitmap Heap Scan on ext_feeder_item i
(cost=13832.40..1449341.79 rows=617228 width=16) (actual
time=894.622..4260.441 rows=185625 loops=1)<br>
              Recheck Cond: (feed_id = ANY ('{54461, ...}'::bigint[]))<br>
              -&gt;  Bitmap Index Scan on ext_feeder_item_idx
(cost=0.00..13678.10 rows=617228 width=0) (actual time=884.686..884.686
rows=185625 loops=1)<br>
                    Index Cond: (feed_id = ANY ('{54461,
...}'::bigint[]))<br>
Total runtime: 4585.852 ms</tt><tt><br>
</tt></li>
<li>If I perform a query with<br>
<tt>IN(..., 54461)<br>
</tt>it stably (5 attempts) takes 9.3..9.5 secs. to perform.<br>
<tt>QUERY PLAN<br>
Limit  (cost=1463104.22..1463104.25 rows=11 width=16) (actual
time=9330.267..9330.298 rows=11 loops=1)<br>
  -&gt;  Sort  (cost=1463104.22..1464647.29 rows=617228 width=16)
(actual time=9330.263..9330.273 rows=11 loops=1)<br>
        Sort Key: pub_date, id<br>
        Sort Method:  top-N heapsort  Memory: 17kB<br>
        -&gt;  Bitmap Heap Scan on ext_feeder_item i
(cost=13832.40..1449341.79 rows=617228 width=16) (actual
time=1018.401..8971.029 rows=185625 loops=1)<br>
              Recheck Cond: (feed_id = ANY ('{... ,54461}'::bigint[]))<br>
              -&gt;  Bitmap Index Scan on ext_feeder_item_idx
(cost=0.00..13678.10 rows=617228 width=0) (actual
time=1008.791..1008.791 rows=185625 loops=1)<br>
                    Index Cond: (feed_id = ANY ('{...
,54461}'::bigint[]))<br>
Total runtime: 9330.729 ms<br>
</tt></li>
</ul>
I don't know what are the roots of the problem, but I think that some
symptomatic healing could be applied: the PostgreSQL could sort the IDs
due to the statistics.<br>
So currently I tend to select the IDs from another table ordering them
due to their weights: it's easy for me thanks to denormalization.<br>
<br>
Also I would expect from PostgreSQL that it sorted the values to make
index scan more sequential, but this expectation already conflicts with
the bug described above :)<br>
</body>
</html>

Attachment Content-Type Size
unknown_filename text/html 4.5 KB

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Oleg Bartunov 2008-05-29 09:39:57 Re: IN() statement values order makes 2x performance hit
Previous Message Jignesh K. Shah 2008-05-29 03:01:37 Re: 2GB or not 2GB