Quick Links

Re: making tid and HOTness of UPDATE available to logical decoding plugins

From:	Hannu Krosing <hannuk(at)google(dot)com>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Euler Taveira <euler(at)eulerto(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Dilip Kumar <dilipkumarb(at)google(dot)com>, Mihail Nikalayeu <mihailnikalayeu(at)gmail(dot)com>
Subject:	Re: making tid and HOTness of UPDATE available to logical decoding plugins
Date:	2025-12-09 20:08:35
Message-ID:	CAMT0RQSXoqkNOG_MXuXaMM+WAdmdNqy9-YQzntsrpLY1oc087w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Dec 9, 2025 at 7:41 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
>
> On Mon, 8 Dec 2025 at 16:25, Hannu Krosing <hannuk(at)google(dot)com> wrote:
> >
> ...
> > Suggestion to "get the user specify their own identity column" sounds
> > good in purely theoretical sense, but can have unacceptable overheads
> > in practice, especially if a large table started out - often for a
> > good reason - without a PK or other suittable identiuty column
>
> I don't think that this user problem is something we should be making
> our problem; at least not like this.

I assume you don't have to deal with real-world users much :)

And we had similar objections for all CONCURRENTLY commands - why on
earth would anyone want a slower version of INDEX

I see again and again real user problems taking ages to address, for
example when I sent a patch it possible to run move than one VACUUM
concurrently and actually clean up tables in 2005 it lingered for
about a year before I got Alvaro to put it in at PostgreSQL
Anniversary Summit in July 2006.

> > As I said before, the row id does not have to be stable across row
> > versions for logical replication, it just has to be able to track the
> > identity of "the row that is changed or deleted" which ctid already
> > does very well.
>
> Yes, I'm aware. But a TID doesn't really mean anything if you don't
> have indexes - its value may change arbitrarily between queries.
> Anything may happen to a tuple's CTID, long as 1.) within a statement,
> the CTID doesn't change, and 2.) indexed TIDs won't have changes to
> the attributes represented in indexes.

Can't parse that :(

Can you perhaps elaborate ?

> > You were very worried about extra WAL usage above, but seem to be
> > oblivious of huge resource usage of REPLICA IDENTITY FULL (when used
> > as a poor man's row id and not because the old row data is needed for
> > some other reason)
>
> I don't generally suggest people run their systems with
> wal_level=logical and try to avoid those systems that have, exactly
> because of issues like the significant additional overhead involved
> with logging page-local update records.

But real-world PostgreSQL users do use logical replication and logical
decoding for CDC all the time.

And logical replication is currently the only way to do no-downtime
major version upgrades.

> > When you at some point discover the need for logical replication of a
> > large table inside a 24/7 production database where you do have
> > occasional updates - or even frequent updates, just not based on
> > unique id - you have currently a few options.
> >
> > 1. add REPLICA IDENTITY FULL
> > - this will double the WAL traffic for updates and usually more
> > than double for DELETEs (could be 1x or 100x)
> > - it can also be REALLY REALLY SLOW to replicate, the worst case
> > requiring 1 sequential scan of the whole table for each UPDATE or
> > DELETE
>
> Yep, that's about expected; if you want performant UPDATE with
> OLTP-style databases you have to index your data.

People don't always do that for all tables, like logs.

And they still may occasionally need to change them, for example
getting a right-to-be-forgotten request and running a delete of all
log records with a specific unindexed attribute. A single sequential
scan is cheap enough to be a good compromise against an all-around
slowdown caused by the index, but if that delete affects 100,000 lines
in a 10TB table you suddenly have 100k sequential scans on the
replica.

> > 2. add a primary key column - quite hard to do CONCURRENTLY, will have
> > severe disk and cpu space demands and once it has been added (which
> > could have taken up to a few weeks) it will slow down any inserts.
>
> It's quite possible. Not trivial, but it is a road that many have
> taken.

Yes, but only because they have no better option than to have a senior
DBA spend a week or two on this.

> And yes, indexing slows down inserts. That's how we make sure
> the index remains correct.

I know *why* it slows down inserts :).
I'm just saying that there are many cases where you don't want this slowdown.

> > To reiterate - "stable across operations" is not at a requirement for
> > logical replication, tuple id is "stable enough" for streaming
> > replication changes. Think of it as somebody changing the primary key
> > column at each update - it seems weird, but the updated PK still
> > uniquely identifies the tuple for the next operation.
>
> Yeah, I'm aware of that part.
>
> > > As I also said in my other mail, adding
> > > ctid to the logical replication system will expose too much internal
> > > information and will turn current logical no-ops into logical
> > > operations;
> >
> > Can you provide an example of this?

I thought if I should mention CLUSTER and VACUUM FULL in the question
but they seemed so obvious that I left them out.

When writing I was in the mindset of multi-terabyte 24/7 high-traffic
databases where these things are out of the question anyway.

...<an example of how CLUSTER works was here> ...

> > > possibly even bloating the subscriber by a good deal more
> > > than what the publisher cleaned up.
> >
> > The absolute biggest bloater is REPLICA IDENTITY FULL.
> >
> > The beauty of using REPLICA IDENTITY ROWID is that absolutely *nothing
> > extra needs to be added to WAL*.
>
> Except for CLUSTER and related operations, which now need to log TID mappings.

I would rather restrict CLUSTER and VACUUM FULL to refuse to run on
tables with REPLICA IDENTITY ROWID and any valid replication slots.

At least until my next proposal for GENERATED ALWAYS AS ROW IDENTITY
gets in (ROWID which starts out as a virtual column showing ctid cast
to bigint, materialized only in case of an update)

> > The tids are already there even for physical replication (and even for
> > wal_level=minimal) as they are required even for crash recovery. All
> > my patch to core did is exposing them to logical decodoing mechanism.
> > You do not have to use them, nothing changes for decoding plugins not
> > using them.
>
> About the patch:
>
> The ItemPointerDatas newly added to ReorderBufferChange.data.tp are
> better stored in {old,new}tuple->t_self, if we're going to store them.

I thought so too, but then found out that no, because old ctid is
present even when oldtuple is not. It is there even for
wal_level=minimal.

also keeping them separate keeps the overhead minimal when the
decoding plugin does not need them.

> The size of .data is currently 32 bytes, increasing that by 16 bytes
> is rather wasteful if we already have a more convenient and accurate
> place to store this data.

Since this is just an in-memory structure I would mainly worry about
going over 64 bytes (x64 cache line, likely also palloc's internal
step)

> I see even less reasons why logical replication can care about
> HOT-ness of updates (that's mostly an implementation detail of heap;
> logically there is and should be no difference between a normal update
> and a HOT update).
> But, if the decoder really did care about those implementation
> details, it would arguably be better pull that data from the tuple's
> infomasks, as then there wouldn't be a need for the additional boolean
> ReorderBufferChange->data.tp.is_hot_update.

It is there for efficiency and low overhead. It can be made even more
efficient by givin HOT update its own if() and skipping the second
comparison.

> Alternatively, move it
> into an alignment gap, because the current placement (after removing
> the TIDs from data.tp) is rather unfortunate and would add 8 bytes to
> the size of this union.

Have to check, maybe there is a free bit somewhere.

But does it push it over 64-byte boundary ?

My main gripe with the union is that there is one struct member after
the union of different-sized things, so changing the lrgest union
member moves that one. Not a big issue, but it would be nice if you
did not have to recompile all extensions that do not need the new
fields.

--
Best Regards
Hannu

In response to

Re: making tid and HOTness of UPDATE available to logical decoding plugins at 2025-12-09 18:41:42 from Matthias van de Meent

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Melanie Plageman	2025-12-09 20:21:15	Re: PoC: Compute a histogram of prune_xid to support autovacuum improvements
Previous Message	Robert Haas	2025-12-09 19:45:47	Re: pg_plan_advice