Quick Links

Re: [PATCH] Fix replica identity mismatch for partitioned tables with publish_via_partition_root

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Mikhail Kharitonov <mikhail(dot)kharitonov(dot)dev(at)gmail(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [PATCH] Fix replica identity mismatch for partitioned tables with publish_via_partition_root
Date:	2026-01-07 02:14:44
Message-ID:	CAD21AoD42cJ61nKjriDuX0K-iZWK6xXA2w35NBK_jNN9qKmW8w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, May 5, 2025 at 1:56 AM Mikhail Kharitonov
<mikhail(dot)kharitonov(dot)dev(at)gmail(dot)com> wrote:
>
> Hi hackers,
>
> An inconsistency was observed when using logical replication on partitioned
> tables with the option `publish_via_partition_root = true`: if REPLICA IDENTITY
> FULL is set only on the parent table, but not on all partitions, logical
> decoding emits UPDATE and DELETE messages with tag 'O' (old tuple) even for
> partitions that do not have full replica identity. In those cases, only the
> primary key columns are included in the message, which contradicts the expected
> meaning of 'O' and violates the logical replication message protocol:
>
> https://www.postgresql.org/docs/current/protocol-logicalrep-message-formats.html
>
> This can cause issues in downstream consumers, which interpret
> the 'O' tag as implying that a full tuple is present.
>
> The attached patch resolves the inconsistency by selecting the correct tuple
> type ('O' vs 'K') based on the replica identity of the actual leaf relation
> being published, rather than using the setting of the root relation alone.
> As a result, the format of logical replication messages aligns with
> the semantics
> defined by the protocol.
>
> Steps to reproduce:
>
> 1. Create a partitioned table with REPLICA IDENTITY FULL on the parent
> and only one of the partitions.
>
> 2. Create a publication with `publish_via_partition_root = true`.
>
> 3. Perform INSERT, UPDATE, DELETE operations through the root table.
>
> 4. Observe via `pg_recvlogical` that for a partition without full replica
> identity, the logical replication stream contains 'O' records with
> only key fields.

I tested this scenario but what I've seen in my env is somewhat
different from the above analysis; pgoutput plugin writes 'O' records
as you mentioned, but it doesn't omit non-key fields, but writes NULL
as non-key fields. Here are my reproducible steps:

create table p (a int not null, b int) partition by list (a);
create table c1 partition of p for values in (1);
create table c2 partition of p for values in (2);
create unique index on c2 (a);
alter table p replica identity full;
alter table c1 replica identity full;
alter table c2 replica identity using INDEX c2_a_idx ;
insert into p values (1, 10), (2, 20);
create publication pub for all tables with (publish_via_partition_root
= 'true');
select pg_create_logical_replication_slot('sub', 'pgoutput');
delete from p where a = 1;
delete from p where a = 2;
select encode(data, 'escape') from
pg_logical_slot_peek_binary_changes('sub', null, null,
'proto_version', '1', 'publication_names', 'pub');

The last pg_logical_slot_peek_binary_changes() writes the two 'D'
(delete) messages:

1. D\000\000(at)\000O\000\x02t\000\000\000\x011t\000\000\000\x0210
2. D\000\000(at)\000O\000\x02t\000\000\000\x012n

What we can know from these messages are:

- Both messages have 'O'.
- Both messages have two columns ('\000\x02').
- The first message has: the first column '1' (length is 1
('\000\000\000\x01')), and the second column '10' (length is 2
('\000\000\000\x02')).
- The second message has: the first column '2', and the second column
NULL ('n').

From these facts, I guess there could be problematic cases even in the
native logical replication. Here are reproducible steps:

-- Publisher
create table p (a int not null, b int) partition by list (a);
create table c1 partition of p for values in (1);
create table c2 partition of p for values in (2);
create unique index on c2 (a);
alter table p replica identity full;
alter table c1 replica identity full;
alter table c2 replica identity using INDEX c2_a_idx ;
insert into p values (1, 10), (2, 20);
create publication pub for all tables with (publish_via_partition_root
= 'true');

-- Subscriber
create table p (a int, b int, c int);
create subscription sub connection 'dbname=postgres port=5551' publication pub;

-- Publisher
delete from p where a = 1; -- generate a message 'DELETE (1, 10)'
delete from p where a = 2; -- generate a message 'DELETE (2, NULL)'

The second delete message cannot find the tuple on the subscriber, so
the table contents are now inconsistent between the publisher and the
subscriber. I need more investigation to verify that it's a problem,
but this behavior doesn't change even with the proposed change.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

[PATCH] Fix replica identity mismatch for partitioned tables with publish_via_partition_root at 2025-05-05 08:18:18 from Mikhail Kharitonov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Etsuro Fujita	2026-01-07 02:18:04	Re: Import Statistics in postgres_fdw before resorting to sampling.
Previous Message	Jeff Davis	2026-01-07 01:05:23	Re: small cleanup of ICU includes