| From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
|---|---|
| To: | Mikhail Kharitonov <mikhail(dot)kharitonov(dot)dev(at)gmail(dot)com> |
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: [PATCH] Fix replica identity mismatch for partitioned tables with publish_via_partition_root |
| Date: | 2026-01-07 02:14:44 |
| Message-ID: | CAD21AoD42cJ61nKjriDuX0K-iZWK6xXA2w35NBK_jNN9qKmW8w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, May 5, 2025 at 1:56 AM Mikhail Kharitonov
<mikhail(dot)kharitonov(dot)dev(at)gmail(dot)com> wrote:
>
> Hi hackers,
>
> An inconsistency was observed when using logical replication on partitioned
> tables with the option `publish_via_partition_root = true`: if REPLICA IDENTITY
> FULL is set only on the parent table, but not on all partitions, logical
> decoding emits UPDATE and DELETE messages with tag 'O' (old tuple) even for
> partitions that do not have full replica identity. In those cases, only the
> primary key columns are included in the message, which contradicts the expected
> meaning of 'O' and violates the logical replication message protocol:
>
> https://www.postgresql.org/docs/current/protocol-logicalrep-message-formats.html
>
> This can cause issues in downstream consumers, which interpret
> the 'O' tag as implying that a full tuple is present.
>
> The attached patch resolves the inconsistency by selecting the correct tuple
> type ('O' vs 'K') based on the replica identity of the actual leaf relation
> being published, rather than using the setting of the root relation alone.
> As a result, the format of logical replication messages aligns with
> the semantics
> defined by the protocol.
>
> Steps to reproduce:
>
> 1. Create a partitioned table with REPLICA IDENTITY FULL on the parent
> and only one of the partitions.
>
> 2. Create a publication with `publish_via_partition_root = true`.
>
> 3. Perform INSERT, UPDATE, DELETE operations through the root table.
>
> 4. Observe via `pg_recvlogical` that for a partition without full replica
> identity, the logical replication stream contains 'O' records with
> only key fields.
I tested this scenario but what I've seen in my env is somewhat
different from the above analysis; pgoutput plugin writes 'O' records
as you mentioned, but it doesn't omit non-key fields, but writes NULL
as non-key fields. Here are my reproducible steps:
create table p (a int not null, b int) partition by list (a);
create table c1 partition of p for values in (1);
create table c2 partition of p for values in (2);
create unique index on c2 (a);
alter table p replica identity full;
alter table c1 replica identity full;
alter table c2 replica identity using INDEX c2_a_idx ;
insert into p values (1, 10), (2, 20);
create publication pub for all tables with (publish_via_partition_root
= 'true');
select pg_create_logical_replication_slot('sub', 'pgoutput');
delete from p where a = 1;
delete from p where a = 2;
select encode(data, 'escape') from
pg_logical_slot_peek_binary_changes('sub', null, null,
'proto_version', '1', 'publication_names', 'pub');
The last pg_logical_slot_peek_binary_changes() writes the two 'D'
(delete) messages:
1. D\000\000(at)\000O\000\x02t\000\000\000\x011t\000\000\000\x0210
2. D\000\000(at)\000O\000\x02t\000\000\000\x012n
What we can know from these messages are:
- Both messages have 'O'.
- Both messages have two columns ('\000\x02').
- The first message has: the first column '1' (length is 1
('\000\000\000\x01')), and the second column '10' (length is 2
('\000\000\000\x02')).
- The second message has: the first column '2', and the second column
NULL ('n').
From these facts, I guess there could be problematic cases even in the
native logical replication. Here are reproducible steps:
-- Publisher
create table p (a int not null, b int) partition by list (a);
create table c1 partition of p for values in (1);
create table c2 partition of p for values in (2);
create unique index on c2 (a);
alter table p replica identity full;
alter table c1 replica identity full;
alter table c2 replica identity using INDEX c2_a_idx ;
insert into p values (1, 10), (2, 20);
create publication pub for all tables with (publish_via_partition_root
= 'true');
-- Subscriber
create table p (a int, b int, c int);
create subscription sub connection 'dbname=postgres port=5551' publication pub;
-- Publisher
delete from p where a = 1; -- generate a message 'DELETE (1, 10)'
delete from p where a = 2; -- generate a message 'DELETE (2, NULL)'
The second delete message cannot find the tuple on the subscriber, so
the table contents are now inconsistent between the publisher and the
subscriber. I need more investigation to verify that it's a problem,
but this behavior doesn't change even with the proposed change.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Etsuro Fujita | 2026-01-07 02:18:04 | Re: Import Statistics in postgres_fdw before resorting to sampling. |
| Previous Message | Jeff Davis | 2026-01-07 01:05:23 | Re: small cleanup of ICU includes |