On-disk compatibility for nbtree-unique-key enhancement

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: On-disk compatibility for nbtree-unique-key enhancement
Date: 2018-09-20 23:18:37
Message-ID: CAH2-WzmjgBz-RL2-nyPc+NRZnU73YSLGEwZRyB2DhUQEdkEujg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

I would like to get some feedback on dealing with on-disk
compatibility for my nbtree patch, which is in the current CF [1]. I
thought I'd start a new thread on these questions, since it's probably
the most sensitive aspect of the project, and can be discussed in a
fairly contained way.

My patch to make nbtree index tuple keys uniquely identifiable by
using heap TID as a differentiating last attribute/part of the key
space complicates space management within routines like
_bt_findsplitloc(). Though, ideally, not very much. In general, we may
need to make new "pivot" tuples [2] (all of which originate as the new
high key added on the left side of a leaf page split) have an
additional, explicit representation of heap TID. I'm MAXALIGN()'ing
ItemPointerData to fit the "extra" heap TID attribute once an actual
page split in underway, so that's an extra 8 bytes that I might have
to append to a nonpivot leaf tuple when a new high key/pivot is
generated during a leaf page split. This 8 bytes needs to be accounted

I want to make a general pessimistic assumption that it's always
necessary to have an extra 8 bytes to fit a heap TID, stolen from the
"1/3 of a buffer" limit that we already enforce as the largest
possible nbtree tuple size. Prior to the current _bt_findsplitloc()
coding (circa 1999), there were bugs about a failure to find a split
point, and I'm keen to avoid more problems along those lines.
Actually, I think that it's essential to keep the space management
logic as simple as possible, with everything ideally confined to a
single page (and not reliant on things that are true across pages). So
I feel reasonably confident that the best thing to do is impose
certain theoretical costs on users in the short term, rather than
doing something fancy and brittle that might be much harder to
stabilize in the long term.

More concretely:

* My patch changes the definition of BTMaxItemSize() for indexes on
the proposed new BTREE_VERSION, v4. We insist that new tuples must be
less than or equal to 2704 bytes in size, including all IndexTuple
overhead, rather than the current 2712 bytes.

* Under this scheme, the "internal" limit will remain 2712 bytes, so
that pivot tuples that are already at the proposed new user-visible
maximum of 2704 can safely have another 8 bytes added.

* This means that there is a compatibility issue for anyone that is
already right on the threshold -- we *really* don't want to see a
REINDEX fail, but that seems like a possibility that we need to talk
about now.

We've documented that there is this 1/3 of a page restriction, but
specify that it's "approximate", so I don't think our documentation as
written would need to be changed (though a release note item would
certainly be in order). Also, since the vagaries of how well TOAST can
compress are very hard to understand, I have to imagine that almost
nobody relies on having the last 8 bytes -- if they do, then they're
playing with fire.

Not being able to REINDEX is still really awful, however unrealistic
the scenario may be. I can imagine ways of avoiding it, but I'd really
rather not go there, since the cure is worse than the disease. I'll
still throw out some ideas for a cure, though:

* Maybe we could rely on the fact that internal pages are the only
pages that can contain more than one maybe-extra-sized 2720 byte pivot
tuple, since in practice they also always have one "negative infinity"
item. Those are always 16 bytes, so it works out.

We've done something a bit like this before [3], though I can see big
downsides to going further with it. It seems pretty grotty to me. I
actually would like to *not* truncate outside of the leftmost page on
the level, so that I can use the not-actually-negative-infinity first
item in internal pages as a "low key". This would make low-overhead
prefix compression within internal pages work. (This is not to be
confused with leaf level prefix compression, which the DBA would have
to enable.)

* Alternatively, we could lift the general restriction on out-of-line
storage within indexes, but I think that that would probably break
certain things in a way that would go undetected for a long time.

It would break amcheck's heapallindexed check, for one thing, and it
might have implications for buffer locking protocols. In general,
out-of-line storage within indexes seems like it works against what an
index is supposed to do. And, it's a lot of effort for a small to
nonexistent reward.

Does it seem like I have the right general idea here?

[1] https://commitfest.postgresql.org/19/1787/
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=3680e69b89a8458d58f6c3361eb612788fa33786;hb=df8b5f3eb8a7c477156d0ad9d83e7297912cfe79#l612
[3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/access/itup.h;h=bd3a702380954bc69c9536876094249430cb7c72;hb=df8b5f3eb8a7c477156d0ad9d83e7297912cfe79#l139
Peter Geoghegan


Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-09-20 23:19:26 Re: [PATCH] Tab completion for ALTER DATABASE … SET TABLESPACE
Previous Message Tom Lane 2018-09-20 23:03:16 Re: [PATCH] Tab completion for ALTER DATABASE … SET TABLESPACE