Re: MaxOffsetNumber for Table AMs

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MaxOffsetNumber for Table AMs
Date: 2021-05-06 11:10:30
Message-ID: CAEze2WgpLcMNWjLuNGhbcKwp2LeEfO9QhGc4T7KbVqi3gzu3jA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 6 May 2021 at 01:22, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Wed, May 5, 2021 at 3:18 PM Matthias van de Meent
> <boekewurm+postgres(at)gmail(dot)com> wrote:
> > I believe that the TID is the unique identifier of that tuple, within context.
> >
> > For normal indexes, the TID as supplied directly by the TableAM is
> > sufficient, as the context is that table.
> > For global indexes, this TID must include enough information to relate
> > it to the table the tuple originated from.
>
> Clearly something like a partition identifier column is sometimes just
> like a regular user-visible column, though occasionally not like one
> -- whichever is useful to the implementation in each context. For
> example, we probably want to do predicate pushdown, maybe with real
> cataloged operators that access the column like any other user-created
> column (the optimizer knows about the column, which even has a
> pg_attribute entry). Note that we only ever access the TID column
> using an insertion scankey today -- so there are several ways in which
> the partition identifier really would be much more like a user column
> than tid/scantid ever was.
>
> The TID is a key column for most purposes as of Postgres 12 (at least
> internally). That didn't break all unique indexes due to the existence
> of non-unique TIDs across duplicates! Insertions that must call
> _bt_check_unique() can deal with the issue directly, by temporarily
> unsetting scantid.
>
> We can easily do roughly the same thing here: be slightly creative
> about how we interpret whether or not the partition identifier is
> "just another key column" across each context. This is also similar to
> the way the implementation is slightly creative about NULL values,
> which are not equal to any other value to the user, but are
> nevertheless just another value from the domain of indexed values to
> the nbtree implementation. Cleverly defining the semantics of keys to
> get better performance and to avoid the need for special case code is
> more or less a standard technique.
>
> > In the whole database, that would be the OID of the table + the TID as
> > supplied by the table.
> >
> > As such, the identifier of the logical row (which can be called the
> > TID), as stored in index tuples in global indexes, would need to
> > consist of the TableAM supplied TID + the (local) id of the table
> > containing the tuple.
>
> 2 points:
>
> 1. Clearly you need to use the partition identifier with the TID in
> order to look up the version in the table -- you need to use both
> together in global indexes. But it can still work in much the same way
> as it would in a standard index -- it's just that you handle that
> extra detail as well. That's what I meant by additive.
>
> 2. If a TID points to a version of a row (or whatever you want to call
> the generalized version of a HOT chain -- almost the same thing), then
> of course you can always map it back to the logical row. That must
> always be true. It is equally true within a global index.
>
> Points 1 and 2 above seem obvious to me...so I think we agree on that
> much.

Yes.

> I just don't know how you go from here to "we need
> variable-width TIDs". In all sincerity, I am confused because to me it
> just seems as if you're asserting that it must be necessary to have
> variable width TIDs again and again, without ever getting around to
> justifying it. Or even trying to.

See below. I'm not saying we need it _right now_, but at the very
least I'd like to argue that we should not close the door on varlena
TIDs, because there _are_ reasons for those TID types. See also below.

> > Assuming we're in agreement on that part, I
> > would think it would be consistent to put this in TID infrastructure,
> > such that all indexes that use such new TID infrastructure can be
> > defined to be global with only minimal effort.
>
> Abstract definitions can be very useful, but ultimately they're just
> tools. They're seldom useful as a starting point in my experience. I
> try to start with the reality on the ground, and perhaps arrive at
> some kind of abstract model or idea much later.

Although I agree that abstract definitions are tools, I disagree that
they're seldom useful as a starting point. Many have implemented b
(plus) trees based on the original paper exactly due to the guarantees
that are provided by the abstract definition as defined in the paper.
When trying to create completely new things I would agree that
starting with the abstract is probably not the right idea, but we're
not in a green field. There are many examples for storage engines /
table AMs outside PostgreSQL, and I believe that we can take learnings
from those for the potential extendability of PostgreSQL.

> > ZHeap states that it can implement stable TIDs within limits, as IIRC
> > it requires retail index deletion support for all indexes on the
> > updated columns of that table.
>
> Whether or not that's true is not at all clear. What is true is that
> the prototype version of zheap that we have as of today is notable in
> that it more or less allows the moral equivalent of a HOT chain to be
> arbitrarily long (or much longer, at least). To the best of my
> knowledge there is nothing about retail index tuple deletion in the
> design, except perhaps something vague and aspirational.

I'm fairly certain that ZHeap's todo list item no. 1 details retail
index tuple deletion as I understand it:

"Delete marking in indexes: This will allow inplace updates even when
index columns are updated and additionally with this we can avoid the
need for a dedicated vacuum process to perform retail deletes."

If I read this correctly, this means asking the index AM to delete the
specific index tuple corresponding with the deleted item AKA retail
index tuple deletion.

> > I fail to see why this same
> > infrastructure could not be used for supporting clustered tables,
> > while enforcing these limits only soft enforced in ZHeap (that is,
> > only allowing index AMs that support retail index tuple deletion).
>
> You're ignoring an ocean of complexity here. Principally the need to
> implement something like two-phase locking (key value locking) in
> indexes to make this work, but also the need to account for how
> fundamentally redefining TID breaks things. To say nothing of how this
> might affect crash recovery.

An ocean of complexity that is (to the best of my knowledge) to be
mostly delegated to the TableAM that implements these varlena TIDs.
I'm not telling you that varlena TIDs are a one-size-fits-all, but I
want to make clear that varlena TIDs are required for certain table
AMs that would otherwise be impossible.

> > > If it was very clear that there would be *some*
> > > significant benefit then the costs might start to look reasonable. But
> > > there isn't. "Build it and they will come" is not at all convincing to
> > > me.
> >
> > Clustered tables / Index-oriented Tables are very useful for tables of
> > which most columns are contained in the PK, or otherwise are often
> > ordered by their PK.
>
> I'm well aware of the fact that clustered index based tables are
> sometimes more useful than heap-based tables.

Then would you also agree that for persistently clustered tables to
work in any form of efficiency, they would need to support variable
length identifiers? Let me explain my reasoning:

If the TableAM TID contains information about it's physical location,
we cannot implement persistent clustered tables due to instability of
this physical location of the base tuple (see also how we cannot
guarantee the physical location of an index tuple in the index, only
the logical location).
That being the case, the TID must be fully logical. I hope you agree
at least up to here.

The TID must order in the same manner as the table's clustering, as
you would want to be able to find the tuple from the table, and the
only guarantee you have is that the table has an ordering. If the TID
does not have the same ordering, you cannot find the correct tuple (or
you'd need a TID -> key mapping, which would mean you'd lose a lot of
the benefits you got from the clustering).

If the TID must share an ordering with the clustering of the table,
then it should also have a similar key space as the clustering: I
could cluster a table on a text column, but any fixed size TID cannot
ever dream to be used to correctly identify the position of a certain
text string within an arbitrary sorted list of text strings. You'll
have to use that text string to find it's location in the list. As
such, I believe that the tuple identifier (or also: the identifier
using which we can locate the tuple data corresponding to that
identifier) for clustered tables / index oriented tables is at the
very least the set of columns that that table is clustered on.

Do note that this is all about the TID supplied by a TableAM that
would implement persistent clustering, which is how MySQL has
implemented tables. I am unaware of a scheme that would maintain
persistent clustering/ordering across the table on key columns without
also leaking these key columns into the identifier that is used to
find the tuple in the table.

With regards,

Matthias van de Meent

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Luzanov 2021-05-06 11:22:56 Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
Previous Message Robert Haas 2021-05-06 11:05:46 Re: [bug?] Missed parallel safety checks, and wrong parallel safety