From: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> |
---|---|
To: | a(dot)sher(at)postgrespro(dot)ru |
Cc: | k(dot)knizhnik(at)postgrespro(dot)ru, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Possible bug in logical replication. |
Date: | 2018-05-18 05:37:14 |
Message-ID: | 20180518.143714.31685500.horiguchi.kyotaro@lab.ntt.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Thu, 17 May 2018 13:54:07 +0300, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> wrote in <87in7md034(dot)fsf(at)ars-thinkpad>
>
> Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> writes:
>
> > I think that using restart_lsn instead of confirmed_flush is not right approach.
> > If restart_lsn is not available and confirmed_flush is pointing to page
> > boundary, then in any case we should somehow handle this case and adjust
> > startlsn to point on the valid record position (by jjust adding page header
> > size?).
>
> Well, restart_lsn is always available on live slot: it is initially set
> in ReplicationSlotReserveWal during slot creation.
restart_lsn stays at the beginning of a transaction until the
transaction ends so just using restart_lsn allows repeated
decoding of a transaction, in short, rewinding occurs. The
function works only for inactive slot so the current code works
fine on this point. Addition to that restart_lsn also can be on a
page bounary.
We can see the problem easily.
1. Just create a logical replication slot with setting current LSN.
select pg_create_logical_replication_slot('s1', 'pgoutput');
2. Advance LSN by two or three pages by doing anyting.
3. Advance the slot to a page bounadry.
e.g. select pg_replication_slot_advance('s1', '0/9624000');
4. advance the slot further, then crash.
So directly set ctx->reader->EndRecPtr by startlsn fixes the
problem, but I found another problem here.
The function accepts any LSN even if it is not at the begiining
of a record. We will see errors or crashs or infinite waiting or
maybe any kind of trouble by such values. The moved LSN must
always be at the "end of a record" (that is, at the start of the
next recored). The attached patch also fixes this.
The documentation doesn't look requiring a fix.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachment | Content-Type | Size |
---|---|---|
fix_pg_logical_replication_slot_advance.patch | text/x-patch | 3.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2018-05-18 05:42:00 | Re: Is a modern build system acceptable for older platforms |
Previous Message | Pavel Stehule | 2018-05-18 05:27:16 | Re: Is a modern build system acceptable for older platforms |