Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location
Date: 2016-11-21 17:42:13
Message-ID: CAGTBQpaL-qru-+q0caipBE8XmsaiaZpy+FzEczh0sg2Th3pfAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 18, 2016 at 11:09 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Fri, Nov 18, 2016 at 5:27 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>> I've been changing the on-disk format considerably, to one that makes
>> more sense.
>
> I can see how that would be necessary. I'm going to suggest some more
> things that need a new btree version number now, too.
>
> I realized today, quite suddenly, why I disliked your original patch:
> it didn't go far enough with embracing the idea of
> heap-item-pointer-as-key. In other words, I didn't like that you
> didn't change anything in the internal pages.

But it did. In fact it only changed internal pages.

> Maybe you should put
> heap TIDs someplace in new internal pages, so that even there we treat
> them as part of the key.

That is indeed what's happening (both in the old and new version).

The new version also opens up to the possibility of omitting the heap
TID in inner pages, if they're redundant (ie: if two consecutive keys
are different already, the heap TID part of the key is redundant,
since it's not necessary to make tree descent decisions).

> This may significantly benefit UPDATE-heavy
> workloads where HOT doesn't work out so well. Particularly for
> non-unique indexes, we currently have to wade through a lot of bloat
> from the leaf level, rather than jumping straight to the correct leaf
> page when updating or inserting.

That is improved in the patch as well. The lookup for an insertion
point for a heavily bloated (unique or not) index is done efficiently,
instead of the O(N^2) way it used before.

> You might not want to do the same with unique indexes, where we must
> initially buffer lock "the first leaf page that a duplicate could be
> on" while we potentially look in one or more additional sibling pages
> (but bloat won't be so bad there, perhaps).

It's done even for unique indexes. Locking in that case becomes
complex, true, but I believe it's not an insurmountable problem.

> Or, you might want to make
> sure that B-Tree tuple insertions still find that "first page" to
> check, despite still generally treating heap item pointer as part of
> the key proper (in unique indexes, it can still behave like NULL, and
> not participate in uniqueness checking, while still essentially being
> part of the key/sort order).

It behaves like that (even though it's not coded like that)

> It also occurs to me that our performance for updates in the event of
> many NULL values in indexes is probably really bad, and could be
> helped by this. You'll want to test all of this with a workload that
> isn't very sympathetic to HOT, of course -- most of these benefits are
> seen when HOT doesn't help.

I haven't really tested with an overabundance of NULLs, I'll add that
to the tests. I have tested low-cardinality indexes though, but only
for correctness.

>> However, I haven't tested the current state of the patch thoroughly.
>>
>> If a WIP is fine, I can post the what I have, rebased.
>
> I'm mostly curious about the effects on bloat. I now feel like you
> haven't sufficiently examined the potential benefits there, since you
> never made heap item pointer a first class part of the key space.
> Maybe you'd like to look into that yourself first?

What do you mean with "first class part"?

It's not included in the scankey for a variety of reasons, not the
least of which is not breaking the interface for current uses, and
because the tid type is quite limited in its capabilities ATM. Also,
the heap TID is usually there on most AM calls that care about it (ie:
insertions have it, of course), so adding it to the scankey also
seemed redundant.

If not adding to the scankey, what do you mean by "first class" then?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2016-11-21 17:53:59 Re: Vacuum: allow usage of more than 1GB of work mem
Previous Message Alvaro Herrera 2016-11-21 17:29:41 Re: delta relations in AFTER triggers