Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location
Date: 2016-11-22 01:15:39
Message-ID: CAGTBQpY6a3qfMuT9MrQk2VFxrUe4dJS3j2CXOh+7kgQ8yKcOiw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 21, 2016 at 5:42 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>>> Or, you might want to make
>>> sure that B-Tree tuple insertions still find that "first page" to
>>> check, despite still generally treating heap item pointer as part of
>>> the key proper (in unique indexes, it can still behave like NULL, and
>>> not participate in uniqueness checking, while still essentially being
>>> part of the key/sort order).
>>
>> It behaves like that (even though it's not coded like that)
>
> Why not? What do you mean?
>
> What I'm interested in evaluating is an implementation where an index
> on (foo) has a sort order/key space that is precisely the same as an
> index on (foo, ctid), with zero exceptions.

Well, the patch does the same sorting, but without explicitly adding
the ctid to the scankey.

When inserting into a unique index, the lookup for the insertion point
proceeds as it would if the key was (foo, ctid), finding a page in the
middle of the range that contain item pointers for foo.

When that happens on a regular index, the insertion is done at that
point and nothing else needs to be done. But when it happens on a
unique index, the code first checks to see if the first item pointer
for foo is on the same page (allegedly a frequent case). If so, the
uniqueness check is done in a very straightforward and efficient
manner. If not, however, the code retraces its steps up the tree to
the first time it followed a key other than foo (that's the only way
it could land at a middle page), and then follows the downlinks
looking at a truncated key (just foo, ignoring ctid).

Kind of what you say, but without the explicit null.

>
>>> It also occurs to me that our performance for updates in the event of
>>> many NULL values in indexes is probably really bad, and could be
>>> helped by this. You'll want to test all of this with a workload that
>>> isn't very sympathetic to HOT, of course -- most of these benefits are
>>> seen when HOT doesn't help.
>>
>> I haven't really tested with an overabundance of NULLs, I'll add that
>> to the tests. I have tested low-cardinality indexes though, but only
>> for correctness.
>
> I think we'll have to model data distribution to evaluate the idea
> well. After all, if there is a uniform distribution of values anyway,
> you're going to end up with many dirty leaf pages. Whereas, in the
> more realistic case where updates are localized, we can hope to better
> amortize the cost of inserting index tuples, and recycling index
> tuples.

Quite possibly

>> What do you mean with "first class part"?
>>
>> It's not included in the scankey for a variety of reasons, not the
>> least of which is not breaking the interface for current uses, and
>> because the tid type is quite limited in its capabilities ATM. Also,
>> the heap TID is usually there on most AM calls that care about it (ie:
>> insertions have it, of course), so adding it to the scankey also
>> seemed redundant.
>
> I mean that insertions into indexes that are low cardinality should be
> largely guided by TID comparisons.

...

>> If not adding to the scankey, what do you mean by "first class" then?
>
> Just what I said about the sort order of an index on "(foo)" now
> precisely matching what we'd get for "(foo, ctid)".

It is the case in both versions of the patch

> There are a couple
> of tricky issues with that that you'd have to look out for, like
> making sure that the high key continues to hold a real TID, which at a
> glance looks like something that just happens already. I'm not sure
> that we preserve the item pointer TID in all cases, since the current
> assumption is that a high key's TID is just filler -- maybe we can
> lose that at some point.

I thought long about that, and inner pages don't need a real TID in
their keys, as those keys only specify key space cutoff points and not
real tids, and high tids in leaf pages are always real.

I haven't had any issue with that, aside from the headaches resulting
from thinking about that for protracted periods of time.

> You should use amcheck to specifically verify that that happens
> reliably in all cases. Presumably, its use of an insertion scankey
> would automatically see the use of TID as a tie-breaker with patched
> Postgres amcheck verification, and so amcheck will work for this
> purpose unmodified.

It's totally on my roadmap. I'll have to adapt it for the new index
format though, it doesn't work without modification.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-11-22 01:17:52 Re: Re: [COMMITTERS] pgsql: autovacuum: Drop orphan temp tables more quickly but with more c
Previous Message Tom Lane 2016-11-22 00:57:14 Re: Re: [COMMITTERS] pgsql: autovacuum: Drop orphan temp tables more quickly but with more c