Re: Proposal: Conflict log history table for Logical Replication

From: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Proposal: Conflict log history table for Logical Replication
Date: 2026-06-29 10:36:10
Message-ID: CANhcyEVf-Yscauc6S6bPikkX7NN8pF3gKDo-szbnEOxeHRwsHg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 27 Jun 2026 at 08:03, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Fri, Jun 26, 2026 at 3:40 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Thu, Jun 25, 2026 at 4:57 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Jun 25, 2026 at 10:34 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > > >
> > > > On Wed, 24 Jun 2026 at 19:27, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > >
> > > >
> > > > 2) jsonb supports indexing, whereas json does not. Was json chosen
> > > > here because inserts are faster due to the lack of parsing and binary
> > > > conversion?
> > > >
> > >
> > > I think it is more because the local/remote tuples are usually data
> > > which we want users to see as it was in original tables, using jsonb
> > > can change the ordering of columns or remove some space etc. We can
> > > add a comment on the lines of: "preserve the exact tuple
> > > representation (column order, formatting) for the audit record; the
> > > value is natively json so this avoids a per-conflict conversion". Now,
> > > this is true for tuple data but I am not so sure if the same thing
> > > applies to the replica_identity column in CLT. Would users ever want
> > > to query using ORDER/GROUP BY on replica_identity or use DISTINCT on
> > > it, because if that is possible then the current schema would result
> > > into ERROR as follows:
> > > postgres=# select * from pg_conflict.pg_conflict_log_16394 order by
> > > replica_identity;
> > > ERROR: could not identify an ordering operator for type json
> > >
> > > This needs more thoughts from the perspective of how users would like
> > > to fetch the data.
> > >
> >
> > The values stored for replica_identity are heterogeneous across the
> > table (different key shapes per relid), and for the RI-FULL case it
> > isn't even a stable identifier, two conflicts on the "same row" can
> > carry different whole-tuple JSON. So, I don't see a case for creating
> > an index on the RI column as well and rather users would like to see
> > the exact value used to find the local tuple. So, we can add a comment
> > on the lines: "The tuple/key columns (remote_tuple, replica_identity,
> > local_conflicts) are typed json rather than jsonb on purpose: they
> > hold an exact audit snapshot of the applied tuples and replica
> > identity, and json preserves the verbatim representation whereas jsonb
> > would normalize it. Indexing them (jsonb's main advantage) wouldn't
> > help anyway, as the conflict log is looked up by its scalar columns
> > (relid, conflict_type, commit timestamp) while these JSON columns are
> > per-conflict payload to inspect, not search keys."
>
> I have added this comments.
>
> > One more thing that bothers me is that it will be inconvenient for
> > users to identify whether logged values for replica_identity are for
> > an index or a complete tuple via replica identity full, so I propose
> > to add a boolean column replica_identity_full. I also considered to
> > add replica_identity_index but preferred a boolean instead because
> > this is a historical/audit row, and the index may be dropped or
> > replaced later, leaving a meaningless OID which is opposite of what an
> > audit record should do.
>
> I have added the column replica_identity_full and ordered the column as follows:
>
> static const ConflictLogColumnDef ConflictLogSchema[] = {
> <----------Other columns---------------->
> { .attname = "replica_identity_full", .atttypid = BOOLOID },
> { .attname = "replica_identity", .atttypid = JSONOID },
> { .attname = "remote_tuple", .atttypid = JSONOID },
> { .attname = "local_conflicts", .atttypid = JSONARRAYOID }
> };
>
> > The other question is whether we create an internal index on one of
> > the columns? Currently, there's no established query pattern to index
> > for. The plausible candidates are relid ("conflicts for table X"), or
> > remote_commit_ts (time-based cleanup) but those are guesses. Indexing
> > the JSON columns we've already ruled out (payload, not keys). Picking
> > an index before we know how people actually query/prune the table
> > risks indexing the wrong thing. We can cosider adding the index in
> > later versions or even in PG20 itself if we get some user feedback.
>
> Yeah, I think we can have index once the first patch is committed
> based on the usage pattern.
>
>
Hi Dilip,

I have tested the patches and have some comments:

1. In the error message, when we try to drop a conflict table, it is
referred system catalog:
postgres=# drop table pg_conflict.pg_conflict_log_16384 ;
ERROR: permission denied: "pg_conflict_log_16384" is a system catalog

But for other cases in the error message it is referred as conflict log table:
postgres=# alter table pg_conflict.pg_conflict_log_16384 set schema public;
ERROR: permission denied: "pg_conflict_log_16384" is a conflict log table
DETAIL: Conflict log tables are system-managed tables for logical
replication conflicts.

I think in the DROP case we should also refer to it as 'conflict log
table' instead of 'system catalog'?
Or Am I missing something here?

2. TRUNCATE and DELETE for conflict log tables are published when
outplugin is 'test_decoding':
Suppose conflict_log_tables have some entries then DELETE/TRUNCATE operations on
the conflict log tables are published. I have tested it with
replication slots with test_decoding plugin.

postgres=# truncate table pg_conflict.pg_conflict_log_16395;
TRUNCATE TABLE
postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL);
lsn | xid | data
------------+-----+---------------------------------------------------------------
0/01792290 | 717 | BEGIN 717
0/01792AD0 | 717 | table pg_conflict.pg_conflict_log_16395: TRUNCATE:
(no-flags)
0/01792C10 | 717 | COMMIT 717
(3 rows)

Is this expected?

postgres=# DELETE from pg_conflict.pg_conflict_log_16395 ;
DELETE 2
postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL);
lsn | xid |
data

------------+-----+------------------------------------------------------------------------
-----------------------------
0/017940A0 | 721 | BEGIN 721
0/017940A0 | 721 | table pg_conflict.pg_conflict_log_16395: DELETE:
(no-tuple-data)
0/017940D8 | 721 | table pg_conflict.pg_conflict_log_16395: DELETE:
(no-tuple-data)
0/01794140 | 721 | COMMIT 721

I tested the behaviour for pg_class, and noticed DELETE on pg_class do not get
published
postgres=# delete from pg_class where relname = 't';
DELETE 1
postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL);
lsn | xid | data
------------+-----+------------
0/0181C1D0 | 735 | BEGIN 735
0/0181D430 | 735 | COMMIT 735
(2 rows)

Also, I tested it with 'pgoutput' plugin and it don't emit any changes
postgres=# truncate table pg_conflict.pg_conflict_log_16395;
TRUNCATE TABLE
postgres=# SELECT * FROM pg_logical_slot_get_binary_changes(
'pgoutput_slot', NULL, NULL, 'proto_version', '1',
'publication_names', 'pub2');
lsn | xid | data
-----+-----+------
(0 rows)

postgres=# DELETE from pg_conflict.pg_conflict_log_16395 ;
DELETE 2
postgres=# SELECT * FROM pg_logical_slot_get_binary_changes(
'pgoutput_slot', NULL, NULL, 'proto_version', '1',
'publication_names', 'pub2');
lsn | xid | data
-----+-----+------
(0 rows)

Since conflict log tables are system-managed tables, I expected them
to behave similarly to pg_class with respect to logical decoding.
However, test_decoding emits DELETE and TRUNCATE changes for conflict
log tables. Is this expected?

Thanks,
Shlok Kyal

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Akshay Joshi 2026-06-29 10:40:47 Re: [PATCH] Add pg_get_table_ddl() to reconstruct CREATE TABLE statements
Previous Message Henson Choi 2026-06-29 10:36:05 Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions