estimate correlation of index separately from table (Re: index fragmentation on insert-only table with non-unique column)

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Ralph Loader <suckfish(at)ihug(dot)co(dot)nz>
Subject: estimate correlation of index separately from table (Re: index fragmentation on insert-only table with non-unique column)
Date: 2017-07-07 23:41:19
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-performance

Months ago I reported an issue with very slow index scan of tables with high
"correlation" of its indexed column, due to (we concluded at the time)
duplicate/repeated values of that column causing many lseek()s. References:

My interim workarounds were an reindex job and daily granularity partitions
(despite having an excessive number of child tables) to query execution time]]
to encourage seq scans for daily analysis jobs rather than idx scan. I've now
cobbled together a patch to throw around and see if we can improve on that. I
tried several changes, hoping to discourage index scan. The logic that seems
to most accurately reflect costs has two changes:

Postgres' existing behavior stores table correlation (heap value vs. position)
but doesn't look at index correlation, so can't distinguish between a just-built
index, and a highly fragmented index, or one with highly-nonsequential TIDs.
My patch causes ANALYZE to do a TID scan (sampled across the MCVs) to determine
correlation of heap TID vs index tuple logical location (as opposed to the
table correlation, computed as: heap TID vs. heap value).

The second change averages separate correlation values of small lists of 300
consecutive TIDs, rather than the course-granularity/aggregate correlation of a
small sample of pages across the entire index. Postgres' existing sampling is
designed to give an even sample across all rows. An issue with this
course-granularity correlation is that the index can have a broad correlation
(to physical heap location), but with many small-scale deviations, which don't
show up due to sampling a relatively small fraction of a large table; and/or
the effect of the deviations is insignificant/noise and correlation is still
computed near 1.

I believe the "large scale" correlation that postgres computes from block
sample fails to well represent small-scale uncorrelated reads which contribute
large number of random reads, but not included in planner cost.

Not only are the index reads highly random (which the planner already assumes),
but the CTIDs referenced within a given btree leaf page are also substantially
non-sequential. It seems to affect INSERTs which, over a short interval of
time have a small-moderate range of column values, but which still have a
strong overall trend in that column WRT time (and heap location). Think of
inserting a "normal" distribution of timestamps centered around now().

My original report shows lseek() for nearly every read on the *heap*. The
original problem was on a table of telecommunications CDRs, indexed by "record
opening time" (start time) and with high correlation value in pg_stats. We
import data records from a file, which is probably more or less in order of
"end time".

That still displays broad correlation on "start time", but individual TIDs are
nowhere near sequential. Two phone calls which end in the same 1 second
interval are not unlikely to have started many minutes apart... but two calls
which end within the same second are very likely to have started within an hour
of each other.. since typical duration is <1h. But, insert volume is high, and
there are substantial repeated keys, so the portion of an index scan returning
CTIDs for some certain key value (timestamp with second resolution in our case)
ends up reading heap tuples for a non-negligible fraction of the table: maybe
only 1%, but that's 300 seeks across 1% of a table which is 10s of GB ...
which is still 300 seeks, and takes long enough that cache is inadequate to
substantially mitigate the cost.

It's not clear to me that there's a good way to evenly *sample* a fraction of
the index blocks in a manner which is agnostic to different AMs. Scanning the
entirety of all indices on relations during (auto) analyze may be too
expensive. So I implemented index scan of the MCV list. I'm guessing this
might cause the correlation to be under-estimated, and prefer bitmap scans
somewhat more than justified, due to btree insertion logic for repeated keys,
to reduce O(n^2) behavior. MCV list isn't perfect since that can happen with
eg. normally distributed floating point values (or timestamp with fractional

I ran pageinspect on a recent child of the table that triggered the original

ts=# SELECT itemoffset, ctid FROM bt_page_items('cdrs_huawei_pgwrecord_2017_07_07_recordopeningtime_idx',6) LIMIT 22 OFFSET 1;
itemoffset | ctid
2 | (81,4)
3 | (5,6)
4 | (6,5)
5 | (9,1)
6 | (10,1)
7 | (14,1)
8 | (21,8)
9 | (25,1)
10 | (30,5)
11 | (41,1)
12 | (48,7)
13 | (49,3)
14 | (50,2)
15 | (51,1)
16 | (54,1)
17 | (57,7)
18 | (58,6)
19 | (59,6)
20 | (63,5)
21 | (71,6)
22 | (72,5)
23 | (78,7)
(22 rows)

=> 22 adjacent tuples referencing 22 different heap pages, only 6 of which are sequential => 16 lseek() aka random page cost.

To generate data demonstrating this problem you can do things like:
| CREATE TABLE t(i int, j int);
\! time for a in `seq 99`; do psql -qc 'INSERT INTO t SELECT * FROM generate_series(1,99)'; done -- or,
| INSERT INTO t SELECT (0.001*a+9*(random()-0.5))::int FROM generate_series(1,99999) a; -- or,

or this one, perhaps closest to our problem case:
| INSERT INTO t SELECT a FROM generate_series(1,999) a, generate_series(1,999) b ORDER BY a+b/9.9;

Note, I was able to create a case using floats without repeated keys:
| INSERT INTO w SELECT i/99999.0+pow(2,(-random())) FROM generate_series(1,999999) i ORDER BY i

-- note: the correlation is even higher for larger tables with same number of
-- repeated keys, which is bad since the cost should go up linearly with the
-- size and associated number of lseek(). That's one component of the problem
-- and I think a necessarily component of any solution.
postgres=# SELECT tablename, attname, correlation, array_length(most_common_vals,1) FROM pg_stats WHERE tablename LIKE 't%';
tablename | attname | correlation | array_length
t | i | 0.995212 | 87
t | j | |
t_i_idx | i | -0.892874 | 87
... this last line is added by my patch.

That's a bad combination, high table correlation means the planner thinks only
a fraction of the heap will be read, and sequentially, so isn't discouraged
from doing index scan. But index TIDs are much less well correlated (0.89 vs

Note: the negative correlation at tuple-level seems to be related to this comment:
src/backend/access/nbtree/nbtinsert.c- * Once we have chosen the page to put the key on, we'll insert it before
src/backend/access/nbtree/nbtinsert.c: * any existing equal keys because of the way _bt_binsrch() works.

Note also:
|postgres=# SELECT leaf_fragmentation FROM pgstatindex('t_i_idx');
|leaf_fragmentation | 67.63
.. but keep in mind that leaf_fragmentation only checks leaf page order, and
not CTID order within index pages. The planner already assumes index pages are
random reads. Maybe it's a good indicator (?), but doesn't lend itself to an
accurate cost estimation.

For testing purposes, with:
| shared_buffers | 128kB
| public | t | table | pryzbyj | 35 MB |
| SET enable_bitmapscan=off;
| SET enable_seqscan=off;
| SELECT pg_backend_pid();
| SELECT sum(j) FROM t WHERE i<99;
| Time: 3974.205 ms

% sudo strace -p 530 2>/tmp/strace-pg
% sed -r '/^\(read|lseek/!d; s/ .*//' /tmp/strace-pg |sort |uniq -c |sort -nr |head
39474 lseek(41,
359 lseek(44,
8 lseek(18,
=> 40k seeks on an 35MB table, that's 10 seeks per heap page!

open("base/12411/16634", O_RDWR) = 41
open("base/12411/16636", O_RDWR) = 44

postgres=# SELECT relfilenode, relname FROM pg_class WHERE relfilenode>16630;
16634 | t
16636 | t_i_idx

2017-07-07 17:45:54.075 CDT [6360] WARNING: HERE 1222: csquared=0.797225 minIO/R-P-C=109.750000 maxIO/R-P-C=4416.000000 costsize.c cost_index 702

With my patch, index scan estimated to take ~4400 seeks, rather than actual
40k, probably because max_IO_cost assumes that a given heap page will be
visited only once. But in the worst case each of its tuples would require a
separate lseek().... I'm not suggesting to change that, since making max_IO
100x higher will probably change query plans more dramatically than desired..
But also note that, unpatched, with table correlation >0.99, postgres would've
under-estimated min_IO_cost not by a factor of 10x but by 400x.

| postgres=# REINDEX INDEX t_i_idx;
| postgres=# ANALYZE t;
| postgres=# SELECT tablename, attname, correlation, array_length(most_common_vals,1) FROM pg_stats WHERE tablename LIKE 't%';
| tablename | attname | correlation | array_length
| -----------+---------+-------------+--------------
| t | i | 0.995235 | 67
| t_i_idx | i | 1 | 67

=> Correctly distinguishes freshly reindexed table.

% sudo strace -p 6428 2>/tmp/strace-pg8
% sed -r '/^\(read|lseek/!d; s/ .*//' /tmp/strace-pg8 |sort |uniq -c |sort -nr
99 lseek(37,

2017-07-07 17:49:47.745 CDT [6428] WARNING: HERE 1222: csquared=1.000000 minIO/R-P-C=108.500000 maxIO/R-P-C=4416.000000 costsize.c cost_index 702
=> csquared=1, so min_IO cost is used instead of something close to max_IO
(this patch doesn't change the computation of those values at all, just changes
correlation, which squared value is used to interpolate between correlated/min
cost and uncorrelated/max cost).

correlated estimate: 108 seeks vs 99 actual, is essentially what unpatched
postgres would've computed by using the table correlation of 0.9952, implicitly
assuming the index to be similarly correlated.

I hope that demonstrates the need to distinguish index correlation from table
correlation. I'm hoping for comments on the existing patch, specifically if
there's a better way to sample the index than "MCVs or partial index scan".
I've left some fragments in place of earlier implementation involving btree
page level fragmentation (like statindex). Also: does it make sense to keeping
MCV/histogram for non-expression index which duplicates table column ? The
stats lookup in selfuncs.c btcostestimate() would have to check for correlation
from index, and rest of stats from its table.

A bitmap layer adds overhead, which should be avoided if not needed. But it
shouldn't be a huge impact, and I think its relative effect is only high when
returning a small number of rows. I'm thinking of a few cases.

- unique / low cardinality index scans or without duplicate keys: should have
low index_pages_fetched(), so max_IO_cost should not be very different from
min_io_cost, and new index-based correlation shouldn't have much effect
different from table correlation.
- unclustered/uncorrelated tables: tables whose heap have low correlation
already discouraged from index scan; this includes tables whose column is
UPDATEd and not just INSERTed;
- table with correlated heap AND index: csquared should still be ~0.99 and not
change much;
- correlated table with uncorrelated index: this is the target case with
intended behavior change

My apologies: I think that's a bit of a brain dump.


Attachment Content-Type Size
pg-index-stats.patch2 text/plain 25.2 KB

In response to


Browse pgsql-performance by date

  From Date Subject
Next Message Peter Geoghegan 2017-07-08 03:27:02 Re: estimate correlation of index separately from table (Re: index fragmentation on insert-only table with non-unique column)
Previous Message Shaun Thomas 2017-07-07 13:12:58 Re: partitioning materialized views