| From: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> |
|---|---|
| To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
| Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Proposal: Conflict log history table for Logical Replication |
| Date: | 2026-06-29 10:36:10 |
| Message-ID: | CANhcyEVf-Yscauc6S6bPikkX7NN8pF3gKDo-szbnEOxeHRwsHg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Sat, 27 Jun 2026 at 08:03, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Fri, Jun 26, 2026 at 3:40 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Thu, Jun 25, 2026 at 4:57 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Jun 25, 2026 at 10:34 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > > >
> > > > On Wed, 24 Jun 2026 at 19:27, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > >
> > > >
> > > > 2) jsonb supports indexing, whereas json does not. Was json chosen
> > > > here because inserts are faster due to the lack of parsing and binary
> > > > conversion?
> > > >
> > >
> > > I think it is more because the local/remote tuples are usually data
> > > which we want users to see as it was in original tables, using jsonb
> > > can change the ordering of columns or remove some space etc. We can
> > > add a comment on the lines of: "preserve the exact tuple
> > > representation (column order, formatting) for the audit record; the
> > > value is natively json so this avoids a per-conflict conversion". Now,
> > > this is true for tuple data but I am not so sure if the same thing
> > > applies to the replica_identity column in CLT. Would users ever want
> > > to query using ORDER/GROUP BY on replica_identity or use DISTINCT on
> > > it, because if that is possible then the current schema would result
> > > into ERROR as follows:
> > > postgres=# select * from pg_conflict.pg_conflict_log_16394 order by
> > > replica_identity;
> > > ERROR: could not identify an ordering operator for type json
> > >
> > > This needs more thoughts from the perspective of how users would like
> > > to fetch the data.
> > >
> >
> > The values stored for replica_identity are heterogeneous across the
> > table (different key shapes per relid), and for the RI-FULL case it
> > isn't even a stable identifier, two conflicts on the "same row" can
> > carry different whole-tuple JSON. So, I don't see a case for creating
> > an index on the RI column as well and rather users would like to see
> > the exact value used to find the local tuple. So, we can add a comment
> > on the lines: "The tuple/key columns (remote_tuple, replica_identity,
> > local_conflicts) are typed json rather than jsonb on purpose: they
> > hold an exact audit snapshot of the applied tuples and replica
> > identity, and json preserves the verbatim representation whereas jsonb
> > would normalize it. Indexing them (jsonb's main advantage) wouldn't
> > help anyway, as the conflict log is looked up by its scalar columns
> > (relid, conflict_type, commit timestamp) while these JSON columns are
> > per-conflict payload to inspect, not search keys."
>
> I have added this comments.
>
> > One more thing that bothers me is that it will be inconvenient for
> > users to identify whether logged values for replica_identity are for
> > an index or a complete tuple via replica identity full, so I propose
> > to add a boolean column replica_identity_full. I also considered to
> > add replica_identity_index but preferred a boolean instead because
> > this is a historical/audit row, and the index may be dropped or
> > replaced later, leaving a meaningless OID which is opposite of what an
> > audit record should do.
>
> I have added the column replica_identity_full and ordered the column as follows:
>
> static const ConflictLogColumnDef ConflictLogSchema[] = {
> <----------Other columns---------------->
> { .attname = "replica_identity_full", .atttypid = BOOLOID },
> { .attname = "replica_identity", .atttypid = JSONOID },
> { .attname = "remote_tuple", .atttypid = JSONOID },
> { .attname = "local_conflicts", .atttypid = JSONARRAYOID }
> };
>
> > The other question is whether we create an internal index on one of
> > the columns? Currently, there's no established query pattern to index
> > for. The plausible candidates are relid ("conflicts for table X"), or
> > remote_commit_ts (time-based cleanup) but those are guesses. Indexing
> > the JSON columns we've already ruled out (payload, not keys). Picking
> > an index before we know how people actually query/prune the table
> > risks indexing the wrong thing. We can cosider adding the index in
> > later versions or even in PG20 itself if we get some user feedback.
>
> Yeah, I think we can have index once the first patch is committed
> based on the usage pattern.
>
>
Hi Dilip,
I have tested the patches and have some comments:
1. In the error message, when we try to drop a conflict table, it is
referred system catalog:
postgres=# drop table pg_conflict.pg_conflict_log_16384 ;
ERROR: permission denied: "pg_conflict_log_16384" is a system catalog
But for other cases in the error message it is referred as conflict log table:
postgres=# alter table pg_conflict.pg_conflict_log_16384 set schema public;
ERROR: permission denied: "pg_conflict_log_16384" is a conflict log table
DETAIL: Conflict log tables are system-managed tables for logical
replication conflicts.
I think in the DROP case we should also refer to it as 'conflict log
table' instead of 'system catalog'?
Or Am I missing something here?
2. TRUNCATE and DELETE for conflict log tables are published when
outplugin is 'test_decoding':
Suppose conflict_log_tables have some entries then DELETE/TRUNCATE operations on
the conflict log tables are published. I have tested it with
replication slots with test_decoding plugin.
postgres=# truncate table pg_conflict.pg_conflict_log_16395;
TRUNCATE TABLE
postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL);
lsn | xid | data
------------+-----+---------------------------------------------------------------
0/01792290 | 717 | BEGIN 717
0/01792AD0 | 717 | table pg_conflict.pg_conflict_log_16395: TRUNCATE:
(no-flags)
0/01792C10 | 717 | COMMIT 717
(3 rows)
Is this expected?
postgres=# DELETE from pg_conflict.pg_conflict_log_16395 ;
DELETE 2
postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL);
lsn | xid |
data
------------+-----+------------------------------------------------------------------------
-----------------------------
0/017940A0 | 721 | BEGIN 721
0/017940A0 | 721 | table pg_conflict.pg_conflict_log_16395: DELETE:
(no-tuple-data)
0/017940D8 | 721 | table pg_conflict.pg_conflict_log_16395: DELETE:
(no-tuple-data)
0/01794140 | 721 | COMMIT 721
I tested the behaviour for pg_class, and noticed DELETE on pg_class do not get
published
postgres=# delete from pg_class where relname = 't';
DELETE 1
postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL);
lsn | xid | data
------------+-----+------------
0/0181C1D0 | 735 | BEGIN 735
0/0181D430 | 735 | COMMIT 735
(2 rows)
Also, I tested it with 'pgoutput' plugin and it don't emit any changes
postgres=# truncate table pg_conflict.pg_conflict_log_16395;
TRUNCATE TABLE
postgres=# SELECT * FROM pg_logical_slot_get_binary_changes(
'pgoutput_slot', NULL, NULL, 'proto_version', '1',
'publication_names', 'pub2');
lsn | xid | data
-----+-----+------
(0 rows)
postgres=# DELETE from pg_conflict.pg_conflict_log_16395 ;
DELETE 2
postgres=# SELECT * FROM pg_logical_slot_get_binary_changes(
'pgoutput_slot', NULL, NULL, 'proto_version', '1',
'publication_names', 'pub2');
lsn | xid | data
-----+-----+------
(0 rows)
Since conflict log tables are system-managed tables, I expected them
to behave similarly to pg_class with respect to logical decoding.
However, test_decoding emits DELETE and TRUNCATE changes for conflict
log tables. Is this expected?
Thanks,
Shlok Kyal
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Akshay Joshi | 2026-06-29 10:40:47 | Re: [PATCH] Add pg_get_table_ddl() to reconstruct CREATE TABLE statements |
| Previous Message | Henson Choi | 2026-06-29 10:36:05 | Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions |