Quick Links

Re: Possible bug in logical replication.

From:	Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	a(dot)sher(at)postgrespro(dot)ru, k(dot)knizhnik(at)postgrespro(dot)ru, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Possible bug in logical replication.
Date:	2018-05-25 06:43:23
Message-ID:	87a7soqlpw.fsf@ars-thinkpad
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello,

Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> writes:

> restart_lsn stays at the beginning of a transaction until the
> transaction ends so just using restart_lsn allows repeated
> decoding of a transaction, in short, rewinding occurs. The
> function works only for inactive slot so the current code works
> fine on this point.

Sorry, I do not follow. restart_lsn is advanced whenever there is a
consistent snapshot dumped (in xl_running_xacts) which is old enough to
wholly decode all xacts not yet confirmed by the client. Could you
please elaborate, what's wrong with that?

> Addition to that restart_lsn also can be on a
> page bounary.

Do you have an example of that? restart_lsn is set initially to WAL
insert position at ReplicationSlotReserveWal, and later it always points
to xl_running_xacts record with consistent snapshot dumped.

> So directly set ctx->reader->EndRecPtr by startlsn fixes the
> problem, but I found another problem here.

There is a minor issue with the patch. Now slot advancement hangs
polling for new WAL on my example from [1]; most probably because we
must exit the loop when ctx->reader->EndRecPtr == moveto.

> The function accepts any LSN even if it is not at the begiining
> of a record. We will see errors or crashs or infinite waiting or
> maybe any kind of trouble by such values. The moved LSN must
> always be at the "end of a record" (that is, at the start of the
> next recored). The attached patch also fixes this.

Indeed, but we have these problems only if we are trying to read WAL
since confirmed_flush.

[1] https://www.postgresql.org/message-id/873720e4hf.fsf%40ars-thinkpad

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Re: Possible bug in logical replication. at 2018-05-18 05:37:14 from Kyotaro HORIGUCHI

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2018-05-25 06:50:48	Re: Keeping temporary tables in shared buffers
Previous Message	Andres Freund	2018-05-25 06:40:55	Re: Redesigning the executor (async, JIT, memory efficiency)