Re: making tid and HOTness of UPDATE available to logical decoding plugins

From: Hannu Krosing <hannuk(at)google(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Euler Taveira <euler(at)eulerto(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Dilip Kumar <dilipkumarb(at)google(dot)com>, Mihail Nikalayeu <mihailnikalayeu(at)gmail(dot)com>
Subject: Re: making tid and HOTness of UPDATE available to logical decoding plugins
Date: 2025-12-08 15:25:20
Message-ID: CAMT0RQTPgS-31_sP4+xmi_iuzXOw41qHG3QFdNpmOfj6WD_iJg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 8, 2025 at 2:46 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
>
> On Fri, 5 Dec 2025 at 16:50, Hannu Krosing <hannuk(at)google(dot)com> wrote:
> >
> > On Fri, Dec 5, 2025 at 3:58 PM Euler Taveira <euler(at)eulerto(dot)com> wrote:
> > >
> > > On Thu, Dec 4, 2025, at 5:58 PM, Hannu Krosing wrote:
> > > > Please find attached a patch that makes tuple ids and info about
> > > > weather it was plain or HOT update available to logical decoding
> > > > callbacks.
> > > >
> > >
> > > My first impression was: why do you want to expose an internal information that
> > > is mostly useless for a broader audience? The logical decoding infrastructure
> > > is a general purpose solution for streaming modifications made to Postgres.
> > > Could you elaborate how other consumers (DBMS, data store, ...) would use it?
> >
> > One "other consumer" that came up was possibility to use logical
> > decoding for collecting changes for CREATE INDEX CONCURRENTLY so there
> > would be no need for 2nd pass of CIC to scan the whole table again.
> >
> > I understand that there already is an ongoing work to do this with a
> > specialized collector, but that involved some other ugliness like
> > having to use a specialized logging index acces methods.
>
> I don't see logical decoding as a viable alternative to any
> indexing-related workload. Creating and maintaining indexes need
> strict limits on their resource usage, and logical decoding is unable
> to give those guarantees:

Are these 'strict limits on their resource usage' defined somewhere so
I could take a look?

> It needs to read WAL, which can be produced
> in approximately arbitrarily large amounts in any period.

If you read the WAL as it is generated, then you are essentially just
filtering an in-memory stream, most of the time just jumping to next
WAL record.

The upsides for logical decoding based collection are
- is that you do not need to have any extra settings and conditions in
you index methods to do in-index-method collection
- you avoid the hassle of synchronoizing collection starts and stops
between all active backends
- you are collecting in a single process, so no overhead from
synchronizing between all the backends that capture/log index
insertions
- you can choose between collecting immediately in a background worker
and collecting later by re-reading WAL.

The one upside (?) of in-index capture is that it will naturally
throttle your production workload if capture can not keep up for some
reason.

> This is further worsened by the index build itself, which will have to write
> out WAL if the relation it's building on also needs to log WAL, which
> then also needs to be parsed and decoded by LR.

Not "parsed and decoded" - just read the (database, tablespace,
relation) triplet, decide "not for me" and jump to the next record.

I am working on this specifically because of huge databases with heavy
production workloads which by definition generate a huge amount of
WAL.
One of the top goals is to avoid REPLICA IDENTITY FULL which can
hugely bloat amount of WAL generated .
The fact that you can turn on REPLICA IDENTITY ROWID with no extra
cost on write side is a bonus (you can even replace the small overhead
of writing th eprimary key recorde by turning on rowid)

> And lastly, we want
> CIC/RIC to work on all indexes, not just those on logged relations. "A
> table with WAL-logging enabled" cannot be a requirement for CIC.

We can use the easy, straightforward collection method - logical
decoding - when available, and fall back to the complicated method
(collecting inside index access method) or the resource-intensive
method (re-scanning the whole table) if logical decoding is
unavailable.

> > > It seems a Postgres-centric solution that you didn't provide strong arguments
> > > in favor of it. How would logical replication take advantage of such change? If
> > > that's the case, share the pgoutput and logical replication changes.
> >
> > Having though about the issue for quite some time I suddenly
> > discovered, that while ctid can not be used as a permanent enough
> > unique id for foreign keys or anything external, it is unique at any
> > moment in time making it very much sufficient for logical replication.
> >
> > The high-level idea is to store the source (publisher) ctid value in
> > an extra column for sorce_ctid in the target (subscriber) table, that
> > column will also have a unique index and is of course NOT NULL (as
> > there can be by definition no row without a ctid) so it will form kind
> > of "replication primary key".
>
> So you're using it as a poor man's id column.

Or a smart man's id column for replication :)

> I understand how you got to this point, but the right solution here
> still is to get the user to specify their own identity column that is
> stable across operations, and not to use the volatile and
> guaranteed-unstable ctid.

Suggestion to "get the user specify their own identity column" sounds
good in purely theoretical sense, but can have unacceptable overheads
in practice, especially if a large table started out - often for a
good reason - without a PK or other suittable identiuty column

As I said before, the row id does not have to be stable across row
versions for logical replication, it just has to be able to track the
identity of "the row that is changed or deleted" which ctid already
does very well.

You were very worried about extra WAL usage above, but seem to be
oblivious of huge resource usage of REPLICA IDENTITY FULL (when used
as a poor man's row id and not because the old row data is needed for
some other reason)

When you at some point discover the need for logical replication of a
large table inside a 24/7 production database where you do have
occasional updates - or even frequent updates, just not based on
unique id - you have currently a few options.

1. add REPLICA IDENTITY FULL
- this will double the WAL traffic for updates and usually more
than double for DELETEs (could be 1x or 100x)
- it can also be REALLY REALLY SLOW to replicate, the worst case
requiring 1 sequential scan of the whole table for each UPDATE or
DELETE
2. add a primary key column - quite hard to do CONCURRENTLY, will have
severe disk and cpu space demands and once it has been added (which
could have taken up to a few weeks) it will slow down any inserts.
3. implement updates and deletes in a way similar to overlay file
systems, where updates and deletes are in a different table and any
reading of the main table needs to join with "the overlay table" for
current state.

To reiterate - "stable across operations" is not at a requirement for
logical replication, tuple id is "stable enough" for streaming
replication changes. Think of it as somebody changing the primary key
column at each update - it seems weird, but the updated PK still
uniquely identifies the tuple for the next operation.

> As I also said in my other mail, adding
> ctid to the logical replication system will expose too much internal
> information and will turn current logical no-ops into logical
> operations;

Can you provide an example of this?

> possibly even bloating the subscriber by a good deal more
> than what the publisher cleaned up.

The absolute biggest bloater is REPLICA IDENTITY FULL.

The beauty of using REPLICA IDENTITY ROWID is that absolutely *nothing
extra needs to be added to WAL*.

The tids are already there even for physical replication (and even for
wal_level=minimal) as they are required even for crash recovery. All
my patch to core did is exposing them to logical decodoing mechanism.
You do not have to use them, nothing changes for decoding plugins not
using them.

---
Best Regards
Hannu

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2025-12-08 15:31:30 Re: Make copyObject work in C++
Previous Message Bryan Green 2025-12-08 15:23:00 Re: [PATCH] Allow complex data for GUC extra.