Quick Links

Re: MaxOffsetNumber for Table AMs

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: MaxOffsetNumber for Table AMs
Date:	2021-05-05 23:22:17
Message-ID:	CAH2-Wzmz-kbJL9KWCYvTGhP0YOgAwxwydu28tXO+OZHAeB=cdA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, May 5, 2021 at 3:18 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> I believe that the TID is the unique identifier of that tuple, within context.
>
> For normal indexes, the TID as supplied directly by the TableAM is
> sufficient, as the context is that table.
> For global indexes, this TID must include enough information to relate
> it to the table the tuple originated from.

Clearly something like a partition identifier column is sometimes just
like a regular user-visible column, though occasionally not like one
-- whichever is useful to the implementation in each context. For
example, we probably want to do predicate pushdown, maybe with real
cataloged operators that access the column like any other user-created
column (the optimizer knows about the column, which even has a
pg_attribute entry). Note that we only ever access the TID column
using an insertion scankey today -- so there are several ways in which
the partition identifier really would be much more like a user column
than tid/scantid ever was.

The TID is a key column for most purposes as of Postgres 12 (at least
internally). That didn't break all unique indexes due to the existence
of non-unique TIDs across duplicates! Insertions that must call
_bt_check_unique() can deal with the issue directly, by temporarily
unsetting scantid.

We can easily do roughly the same thing here: be slightly creative
about how we interpret whether or not the partition identifier is
"just another key column" across each context. This is also similar to
the way the implementation is slightly creative about NULL values,
which are not equal to any other value to the user, but are
nevertheless just another value from the domain of indexed values to
the nbtree implementation. Cleverly defining the semantics of keys to
get better performance and to avoid the need for special case code is
more or less a standard technique.

> In the whole database, that would be the OID of the table + the TID as
> supplied by the table.
>
> As such, the identifier of the logical row (which can be called the
> TID), as stored in index tuples in global indexes, would need to
> consist of the TableAM supplied TID + the (local) id of the table
> containing the tuple.

2 points:

1. Clearly you need to use the partition identifier with the TID in
order to look up the version in the table -- you need to use both
together in global indexes. But it can still work in much the same way
as it would in a standard index -- it's just that you handle that
extra detail as well. That's what I meant by additive.

2. If a TID points to a version of a row (or whatever you want to call
the generalized version of a HOT chain -- almost the same thing), then
of course you can always map it back to the logical row. That must
always be true. It is equally true within a global index.

Points 1 and 2 above seem obvious to me...so I think we agree on that
much. I just don't know how you go from here to "we need
variable-width TIDs". In all sincerity, I am confused because to me it
just seems as if you're asserting that it must be necessary to have
variable width TIDs again and again, without ever getting around to
justifying it. Or even trying to.

> Assuming we're in agreement on that part, I
> would think it would be consistent to put this in TID infrastructure,
> such that all indexes that use such new TID infrastructure can be
> defined to be global with only minimal effort.

Abstract definitions can be very useful, but ultimately they're just
tools. They're seldom useful as a starting point in my experience. I
try to start with the reality on the ground, and perhaps arrive at
some kind of abstract model or idea much later.

> ZHeap states that it can implement stable TIDs within limits, as IIRC
> it requires retail index deletion support for all indexes on the
> updated columns of that table.

Whether or not that's true is not at all clear. What is true is that
the prototype version of zheap that we have as of today is notable in
that it more or less allows the moral equivalent of a HOT chain to be
arbitrarily long (or much longer, at least). To the best of my
knowledge there is nothing about retail index tuple deletion in the
design, except perhaps something vague and aspirational.

> I fail to see why this same
> infrastructure could not be used for supporting clustered tables,
> while enforcing these limits only soft enforced in ZHeap (that is,
> only allowing index AMs that support retail index tuple deletion).

You're ignoring an ocean of complexity here. Principally the need to
implement something like two-phase locking (key value locking) in
indexes to make this work, but also the need to account for how
fundamentally redefining TID breaks things. To say nothing of how this
might affect crash recovery.

> > If it was very clear that there would be *some*
> > significant benefit then the costs might start to look reasonable. But
> > there isn't. "Build it and they will come" is not at all convincing to
> > me.
>
> Clustered tables / Index-oriented Tables are very useful for tables of
> which most columns are contained in the PK, or otherwise are often
> ordered by their PK.

I'm well aware of the fact that clustered index based tables are
sometimes more useful than heap-based tables.

--
Peter Geoghegan

In response to

Re: MaxOffsetNumber for Table AMs at 2021-05-05 22:18:17 from Matthias van de Meent

Responses

Re: MaxOffsetNumber for Table AMs at 2021-05-06 11:10:30 from Matthias van de Meent

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruce Momjian	2021-05-06 00:01:22	Re: v14 mechanical code beautification patches
Previous Message	Tom Lane	2021-05-05 23:08:35	Re: v14 mechanical code beautification patches