Re: Bad pg_stats.correlation with repeated keys.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ralph Loader <ralph(dot)loader(at)emulex(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bad pg_stats.correlation with repeated keys.
Date: 2013-11-05 23:24:33
Message-ID: 3530.1383693873@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

[ catching up on old email ]

Ralph Loader <ralph(dot)loader(at)emulex(dot)com> writes:
> I'm experiencing a performance problem with sql queries on a table with
> large numbers of repeated keys.
> I've tracked it down to the pg_stats.correlation being overly
> optimistic ...
> [ for an index on a low-resolution time-of-insertion column ]

> * The pg_stats.correlation value on the index is 0.999993 (varies
> slightly, nearly always 5 nines). The strace output indicates that this
> is completely wrong.

> * Tweaking the postgres code to limit the correlation used to the range
> -0.99 to 0.99 immediately got the query planner using bitmap-scan.

> I suspect that the cause is related to this comment in nbtinsert.c:

> "If the new key is equal to one or more existing keys, we can
> legitimately place it anywhere in the series of equal keys"

> This is true, but arbitrarily ordering repeat keys will break the
> correlation between the index order and the table data order.

> I think that either pg_stats.correlation should reflect that fact, or
> even better, the btree insert should attempt to actually match the on
> disk data order. [For my use case, repeated keys insert after existing
> would be right, I am not sure if that is always true.]

I don't like the idea of monkeying with the correlation estimate here.

It seems like the ideal fix is to get rid of the randomized choice in
btree insertion in favor of breaking ties using the TID to be inserted;
that is, for insertion attempts, we'd view the tuple TID as an additional
low-order index column, and thus there would never be any true duplicates
and every incoming index entry would have a unique place where it should
go. This would guarantee that reading a long sequence of duplicate keys
from the index would visit the heap in TID order.

However, I'm unsure what such a rule would do to insertion costs. The
point of the existing code is to avoid unnecessary index page splits.
Before we had that logic, a bunch of equal-keyed insertions would all
land at the same spot in the index, and thus result in a series of
page splits, even though there might be plenty of free space in nearby
pages. I'm worried that if we're inserting tuples in TID order (which'd
usually be the case for an insert-mostly table like yours), we'd end up
with the same problem only at the right end instead of the left end of
the run of duplicates.

Another problem is backward compatibility with existing btree indexes in
which equal-keyed entries fail to appear in TID order. The tree descent
algorithms would see that as an index with corrupt ordering, and possibly
misbehave. It might be that no serious harm would ensue, and you'd still
get the new index entry inserted at some legal place within the run of
matching key values. Or maybe not; it'd take some investigation to be
sure.

Anyway, I'm not terribly excited about this scenario myself, and am not
volunteering to do any of the above. Just throwing out some thoughts
in case anybody else wants to work on it.

regards, tom lane

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message robert 2013-11-06 01:07:11 BUG #8578: loading a 33G (compressed) pg_dump into a fresh host and db instance crashes a postgresql process
Previous Message Kevin Grittner 2013-11-05 21:59:24 Re: BUG #8577: pg_dump custom format exported dump can't be imported again