Help required to debug pg_repack breaking logical replication

From: Daniele Varrazzo <daniele(dot)varrazzo(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Help required to debug pg_repack breaking logical replication
Date: 2017-10-07 18:37:27
Message-ID: CA+mi_8YWReON2gVk9qoeJLRVzWTSSoLkKAX9DERL36-n4Y8rZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

we have been reported, and I have experienced a couple of times,
pg_repack breaking logical replication.

- https://github.com/reorg/pg_repack/issues/135
- https://github.com/2ndQuadrant/pglogical/issues/113

In my experience, after the botched run, the replication slot was
"stuck", and any attempt of reading (including
pg_logical_slot_peek_changes()) blocked until ctrl-c. I've tried
replicating the issue but first attempts have failed to fail.

In the above issue #113, Petr Jelinek commented:

> From quick look at pg_repack, the way it does table rewrite is almost guaranteed
> to break logical decoding unless there is zero unconsumed changes for a given table
> as it does not build the necessary mappings info for logical decoding that standard
> heap rewrite in postgres does.

unfortunately he didn't follow up to further details requests.

I've started drilling down the problem, observing that the swap
function, swap_heap_or_index_files() [1] was cargoculted by the
original author from the CLUSTER command code as of PG 8.2 [2] (with a
custom addition to update the relfrozenxid which seems backwards to me
as it sets the older frozen xid on the new table [3]).

[1] https://github.com/reorg/pg_repack/blob/ver_1.4.1/lib/repack.c#L1082
[2] https://github.com/postgres/postgres/blob/REL8_2_STABLE/src/backend/commands/cluster.c#L783
[3] https://github.com/reorg/pg_repack/issues/152

so that code is effectively missing a good 10 years of development.
Before jumping into fast-forwarding it, I would like to ask for some
help, i.e.

- Is Petr diagnosis right and freezing of logical replication is to be
blamed to missing mapping?
- Can you suggest a test to reproduce the issue reliably?
- What are mapped relations anyway?

Thank you in advance for any help (either info about how to fix the
issue properly, or a patch by someone who happens to really understand
what we are talking about).

-- Daniele

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message konstantin knizhnik 2017-10-07 19:39:09 Slow synchronous logical replication
Previous Message Tom Lane 2017-10-07 17:59:20 Re: Discussion on missing optimizations