Re: Possible bug in logical replication.

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Possible bug in logical replication.
Date: 2018-05-23 06:56:22
Message-ID: CAD21AoA+5Tz0Z8zHOmD=sU5F=cygoEjHs7WvbBDL07fH9ayVaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 18, 2018 at 2:37 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> At Thu, 17 May 2018 13:54:07 +0300, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> wrote in <87in7md034(dot)fsf(at)ars-thinkpad>
>>
>> Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> writes:
>>
>> > I think that using restart_lsn instead of confirmed_flush is not right approach.
>> > If restart_lsn is not available and confirmed_flush is pointing to page
>> > boundary, then in any case we should somehow handle this case and adjust
>> > startlsn to point on the valid record position (by jjust adding page header
>> > size?).
>>
>> Well, restart_lsn is always available on live slot: it is initially set
>> in ReplicationSlotReserveWal during slot creation.
>
> restart_lsn stays at the beginning of a transaction until the
> transaction ends so just using restart_lsn allows repeated
> decoding of a transaction, in short, rewinding occurs. The
> function works only for inactive slot so the current code works
> fine on this point. Addition to that restart_lsn also can be on a
> page bounary.
>
>
> We can see the problem easily.
>
> 1. Just create a logical replication slot with setting current LSN.
>
> select pg_create_logical_replication_slot('s1', 'pgoutput');
>
> 2. Advance LSN by two or three pages by doing anyting.
>
> 3. Advance the slot to a page bounadry.
>
> e.g. select pg_replication_slot_advance('s1', '0/9624000');
>
> 4. advance the slot further, then crash.
>
> So directly set ctx->reader->EndRecPtr by startlsn fixes the
> problem, but I found another problem here.

I confirmed that the attached patch fixes this problem as well as the
same problem reported on another thread.

I'm not sure it's a good approach to change the state of xlogreader
directly in the replication slot codes because it also means that we
have to care about this code as well when xlogreader code is changed.
Another possible way might be to make XLogFindNextRecord valid in
backend code and move startlsn to the first valid record with an lsn
>= startlsn by using that function. Please find attached patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
find_next_valid_record.patch application/octet-stream 1.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2018-05-23 06:59:43 Re: SCRAM with channel binding downgrade attack
Previous Message Heikki Linnakangas 2018-05-23 06:46:14 Re: SCRAM with channel binding downgrade attack