Re: On-disk compatibility for nbtree-unique-key enhancement

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: On-disk compatibility for nbtree-unique-key enhancement
Date: 2018-09-28 20:03:01
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Fri, Sep 28, 2018 at 8:00 AM Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> On 21/09/2018 01:18, Peter Geoghegan wrote:
> > * This means that there is a compatibility issue for anyone that is
> > already right on the threshold -- we *really* don't want to see a
> > REINDEX fail, but that seems like a possibility that we need to talk
> > about now.
> When would the REINDEX need to happen? Will the code still be able to
> read and write v3 btrees?

The patch doesn't do that at the moment, because I've been busy
refining it, and because there are a couple of outstanding questions
about how to go about it -- the questions I pose on this thread. I
accept that it's absolutely essential that nbtree be able to read both
v2 and v3 indexes as part of any new v4. Without a measurable
performance penalty. That's the standard that this project should be
held to.

A REINDEX will never *need* to happen. v2 and v3 indexes will
gradually go extinct, without many users actually noticing.

The on-disk representation of my patch leaves several free status bits
in INDEX_ALT_TID_MASK tuples free (3 total will remain, since I'm now
using 1 of the 4 for BT_HEAP_TID_ATTR), so it should be easier to add
various further enhancements to a v5 or v6 of nbtree. This is similar
to how changes to GIN were managed in the past (it may be interesting
to look at a GIN leaf page with pg_hexedit, since it'll show you the
gory details in a relatively accessible way). I can imagine a
INDEX_ALT_TID_MASK bit being used for tuples that point to the heap --
not just for pivot tuples. I have an eye on things like duplicate
lists on the leaf level, which would probably work like a GIN posting

> Could there perhaps be an amcheck or
> pageinspect feature that tells you ahead of time if there are too large
> items in an old index?

That would be easy, but it might not be any better than just having
REINDEX or CREATE INDEX [CONCURRENTLY] throw an error. They're already
pretty fast. I could easily raise a WARNING when amcheck is run
against an index of a version before v4, that has an index tuple
that's too big to make it under the lower limit. Actually, I could
even write an SQL query that had pageinspect notice affected tuples,
without changing any C code.

Bear in mind that TOAST compression accidentally plays a big role
here. It makes it very unlikely that indexes in the field are right at
the old 2712 byte threshold, without even 8 bytes of wiggle room,
because it's impossible to predict how well the pglz compression will
work with that kind of precision. Several highly improbable things
need to happen at the same time before REINDEX can break. I cannot see
how any app could have evolved to depend on having 2712 bytes, without
even a single MAXALIGN() quantum to spare.

I wrote a stress test around the new "1/3 of a page" restriction. It
involved a large text attribute with PLAIN storage, since I couldn't
sensibly test the restriction while using pglz compression in the
index. When all of your tuples are 2704 bytes, you end up with a
ridiculously tall B-Tree, that performs horribly. I think that I saw
that it had 11 levels with the test case. The tallest B-Tree that
you'll ever see in the wild is probably one that's 5 levels deep,
which is very tall indeed. Because of the logarithmic nature of how a
new level is added to a B-Tree, 11 levels is just ludicrous. (Granted,
you only have to have one tuple that's precisely 2712 bytes in length
for REINDEX to break.)

Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-09-28 20:12:45 Re: Pithy patch for more detailed error reporting when effective_io_concurrency is set to nonzero on platforms lacking posix_fadvise()
Previous Message Tom Lane 2018-09-28 19:26:14 Re: Odd 9.4, 9.3 buildfarm failure on s390x