Re: Possible bug in logical replication.

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: a(dot)sher(at)postgrespro(dot)ru
Cc: k(dot)knizhnik(at)postgrespro(dot)ru, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Possible bug in logical replication.
Date: 2018-05-18 05:37:14
Message-ID: 20180518.143714.31685500.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Thu, 17 May 2018 13:54:07 +0300, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> wrote in <87in7md034(dot)fsf(at)ars-thinkpad>
>
> Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> writes:
>
> > I think that using restart_lsn instead of confirmed_flush is not right approach.
> > If restart_lsn is not available and confirmed_flush is pointing to page
> > boundary, then in any case we should somehow handle this case and adjust
> > startlsn to point on the valid record position (by jjust adding page header
> > size?).
>
> Well, restart_lsn is always available on live slot: it is initially set
> in ReplicationSlotReserveWal during slot creation.

restart_lsn stays at the beginning of a transaction until the
transaction ends so just using restart_lsn allows repeated
decoding of a transaction, in short, rewinding occurs. The
function works only for inactive slot so the current code works
fine on this point. Addition to that restart_lsn also can be on a
page bounary.

We can see the problem easily.

1. Just create a logical replication slot with setting current LSN.

select pg_create_logical_replication_slot('s1', 'pgoutput');

2. Advance LSN by two or three pages by doing anyting.

3. Advance the slot to a page bounadry.

e.g. select pg_replication_slot_advance('s1', '0/9624000');

4. advance the slot further, then crash.

So directly set ctx->reader->EndRecPtr by startlsn fixes the
problem, but I found another problem here.

The function accepts any LSN even if it is not at the begiining
of a record. We will see errors or crashs or infinite waiting or
maybe any kind of trouble by such values. The moved LSN must
always be at the "end of a record" (that is, at the start of the
next recored). The attached patch also fixes this.

The documentation doesn't look requiring a fix.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
fix_pg_logical_replication_slot_advance.patch text/x-patch 3.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-05-18 05:42:00 Re: Is a modern build system acceptable for older platforms
Previous Message Pavel Stehule 2018-05-18 05:27:16 Re: Is a modern build system acceptable for older platforms