Bad pg_stats.correlation with repeated keys.

From: Ralph Loader <ralph(dot)loader(at)emulex(dot)com>
To: <pgsql-bugs(at)postgresql(dot)org>
Subject: Bad pg_stats.correlation with repeated keys.
Date: 2013-08-15 23:10:41
Message-ID: 520D5FF1.1010307@emulex.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

I'm experiencing a performance problem with sql queries on a table with
large numbers of repeated keys.

I've tracked it down to the pg_stats.correlation being overly
optimistic. The structure is:

* Tables of ~ 100 million 100 byte rows, with a per-second timestamp.
* A single btree index on the timestamp.
* We're inserting ~ 10000 rows / second, almost (but not entirely) in
timestamp order.
* The tables are insert-only : deletes are done exclusively by dropping
entire tables.

I found:

* Disk I/O bound queries on the tables are much faster with a
bitmap-scan than an index-scan [it's an order of magnitude difference].

* No matter how high I set random_page_cost (and with
cursor_tuple_fraction=1), I could not get the query planner to use
bitmap-scans. It would insist on using index-scans until I set
random_page_cost to around 10000, and then flip to a full table seq-scan.

* Running strace on the postgres process with an index-scan, the data
reads are highly non-sequential, lseek()ing on nearly every read.

* The pg_stats.correlation value on the index is 0.999993 (varies
slightly, nearly always 5 nines). The strace output indicates that this
is completely wrong.

* Tweaking the postgres code to limit the correlation used to the range
-0.99 to 0.99 immediately got the query planner using bitmap-scan.

I suspect that the cause is related to this comment in nbtinsert.c:

"If the new key is equal to one or more existing keys, we can
legitimately place it anywhere in the series of equal keys"

This is true, but arbitrarily ordering repeat keys will break the
correlation between the index order and the table data order.

I think that either pg_stats.correlation should reflect that fact, or
even better, the btree insert should attempt to actually match the on
disk order. [For my use case, repeated keys insert after existing would
be right, I am not sure if that is always true.]

Cheers,
Ralph Loader.

PS. A tangentally related comment : bitmap scans achieve good ordering
for the data reads, but not necessarily for the index look-up. A
breadth-first btree lookup that takes disk-order into account would give
a significant performance gain for large un-cached indexes.

________________________________
Ralph Loader

Senior Software Engineer
Level 2, Building A, The Millennium Building Phase 2, 600 Great South Road, Ellerslie
Auckland City,
Auckland,
1051

+64 9 926 2902 Office
www.emulex.com
[http://www.emulex.com/files/signature/emulexendacelogo.jpg] <http://www.emulex.com/emulex-connects/>
This message contains Emulex confidential information intended only for specific recipients and is not to be forwarded to anyone else. If you have received this message in error, please delete it immediately. Thank you.

Browse pgsql-bugs by date

  From Date Subject
Next Message Ralph Loader 2013-08-15 23:36:48 Bad pg_stats.correlation with repeated keys.
Previous Message Cal Martin 2013-08-15 20:16:29