Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location
Date: 2016-11-19 02:09:53
Message-ID: CAM3SWZQKde8E0Kf6FeDedSMU54RfqtW2kCVwqBspBe0mEunZKA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 18, 2016 at 5:27 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> I've been changing the on-disk format considerably, to one that makes
> more sense.

I can see how that would be necessary. I'm going to suggest some more
things that need a new btree version number now, too.

I realized today, quite suddenly, why I disliked your original patch:
it didn't go far enough with embracing the idea of
heap-item-pointer-as-key. In other words, I didn't like that you
didn't change anything in the internal pages. Maybe you should put
heap TIDs someplace in new internal pages, so that even there we treat
them as part of the key. This may significantly benefit UPDATE-heavy
workloads where HOT doesn't work out so well. Particularly for
non-unique indexes, we currently have to wade through a lot of bloat
from the leaf level, rather than jumping straight to the correct leaf
page when updating or inserting.

You might not want to do the same with unique indexes, where we must
initially buffer lock "the first leaf page that a duplicate could be
on" while we potentially look in one or more additional sibling pages
(but bloat won't be so bad there, perhaps). Or, you might want to make
sure that B-Tree tuple insertions still find that "first page" to
check, despite still generally treating heap item pointer as part of
the key proper (in unique indexes, it can still behave like NULL, and
not participate in uniqueness checking, while still essentially being
part of the key/sort order).

Of course, it isn't great to add stuff to the internal pages, but if
you're also implementing suffix truncation, there is probably no harm
at all. You're only going to affect cases that will also greatly
benefit from having that extra information, to guide insertions of new
values to the right place more efficiently (by inserters, I mostly
mean insertions from UPDATEs).

It also occurs to me that our performance for updates in the event of
many NULL values in indexes is probably really bad, and could be
helped by this. You'll want to test all of this with a workload that
isn't very sympathetic to HOT, of course -- most of these benefits are
seen when HOT doesn't help.

> However, I haven't tested the current state of the patch thoroughly.
>
> If a WIP is fine, I can post the what I have, rebased.

I'm mostly curious about the effects on bloat. I now feel like you
haven't sufficiently examined the potential benefits there, since you
never made heap item pointer a first class part of the key space.
Maybe you'd like to look into that yourself first?

>> I also think that this might help with bitmap index scans.
>
> How so?

I was thinking about teaching nbtree to start the scan at the level
above the leaf level. If the heap TID was part of the key proper, one
can imagine it being much cheaper to build a bitmap for a moderate
cardinality value (e.g., NULL values needed for "WHERE foo IS NULL"
index scans). The exact details would need to be worked out, but one
can imagine building a bitmap that is much larger than necessary one
level up from the leaf, then lazily unsetting bits as we descend the
B-Tree to the leaf level and find pages whose bitmap bits can be
unset. We might balance the avoidance of I/O during the scan against
how much we might expect to save in a subsequent bitmap heap scan, and
so on. This might be based on a selectivity estimate.

That's all fairly hand-wavy, certainly, but I see significant
potential along these lines.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-11-19 02:20:16 Re: Mail thread references in commits
Previous Message Andres Freund 2016-11-19 02:05:43 Re: Mail thread references in commits