Re: index problems (again)

From: Geoff Winkless <pgsqladmin(at)geoff(dot)dj>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: index problems (again)
Date: 2016-03-07 16:37:37
Message-ID: CAEzk6fdjPxaVNXXH0TiqPacJDOvt7z-Fma=iXtKOrpYY5CivZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 7 March 2016 at 16:02, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> In English, what that plan is trying to do is scan the index
> in sc_id order until it hits a row with scdate in the target range.
> The first such row, by definition, has the correct min(sc_id) value.
> The problem is that we're guessing at how soon we'll hit such a row.
> If the columns are independent, then the planner can guess based on how
> many rows in the whole table have scdate in the target range, and it
> will probably be about right. But that estimate can fall down very
> badly if sc_id and scdate increase together, because then the target
> rows aren't randomly distributed in the index sequence but could all be
> all the way at the far end of the index.

I'm sorry, I'm obviously not being clear. I already accepted this
argument when Victor gave it, although I believe that in part it falls
down because sc_id is also (potentially) randomly distributed so it's
not like you're doing a sequential table scan (it might work better on
a clustered table, but we don't have those :) )

So you still have an extra layer of indirection into a large table
with lots of random accesses.

> If we had cross-column correlation stats we could detect this pitfall,
> but without that it's hard to do.

But as far as I can see, apart from the absolute extremes, the
index-only scan is _always_ going to be quicker than the index+table
scan. It doesn't matter whether or not the distribution is random or
skewed, the index-only scan is going to be better (or approximately
equally as good). We can see that by the massive speedup I get by
using index(scid,scdate), which in all other respects is going to
suffer from exactly the same problem from that the scid-only index
suffers.

And the real advantage: at the extremes, the index-only worst-case is
minimally worse than the best case. Whereas the worst-case of the
index-scan-plus-table-compare method is horrific.

I don't believe you need any further statistics than what is currently
available to be able to make that judgement, and that's why I believe
it's suboptimal.

Geoff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2016-03-07 16:44:38 Re: index problems (again)
Previous Message Adrian Klaver 2016-03-07 16:35:39 Re: Inserting JSON via Java PreparedStatment