Re: Bug in Physical Replication Slots (at least 9.5)?

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-01-19 09:37:31
Message-ID: 20170119.183731.223893446.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hello,

At Wed, 18 Jan 2017 12:34:51 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqQytF2giE7FD-4oJJpPVwiKJrDQPc24hLNGThX01SbSmA(at)mail(dot)gmail(dot)com>
> On Tue, Jan 17, 2017 at 7:36 PM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > I managed to reproduce this. A little tweak as the first patch
> > lets the standby to suicide as soon as walreceiver sees a
> > contrecord at the beginning of a segment.
>
> Good idea.

Thanks. Fortunately(?), the problematic situation seems to happen
at almost all segment boundary.

> > I believe that a continuation record cannot be span over three or
> > more *segments* (is it right?), so keeping one spare segment
> > would be enough. The attached second patch does this.
>
> I have to admit that I did not think about this problem much yet (I
> bookmarked this report weeks ago to be honest as something to look
> at), but that does not look right to me. Couldn't a record be spawned
> across even more segments? Take a random string longer than 64MB or
> event longer for example.

Though I haven't look closer to how a modification is splitted
into WAL records. A tuple cannot be so long. As a simple test, I
observed rechder->xl_tot_len at the end of XLogRecordAssemble
inserting an about 400KB not-so-compressable string into a text
column, but I saw a series of many records with shorter than
several thousand bytes.

> > Other possible measures might be,
> >
> > - Allowing switching wal source while reading a continuation
> > record. Currently ReadRecord assumes that a continuation record
> > can be read from single source. But this needs refactoring
> > involving xlog.c, xlogreader.c and relatives.
>
> This is scary thinking about back-branches.

Yes. It would be no longer a bug fix. (Or becomes a quite ugly hack..)

> > - Delaying recycling a segment until the last partial record on it
> > completes. This seems doable in page-wise (coarse resolution)
> > but would cost additional reading of past xlog files (page
> > header of past pages is required).
>
> Hm, yes. That looks like the least invasive way to go. At least that
> looks more correct than the others.

The attached patch does that. Usually it reads page headers only
on segment boundaries, but once continuation record found (or
failed to read the next page header, that is, the first record on
the first page in the next segment has not been replicated), it
becomes to happen on every page boundary until non-continuation
page comes.

I leave a debug info (at LOG level) in the attached file shown on
every state change of keep pointer. At least for pgbench, the
cost seems ignorable.

> > - Delaying write/flush feedback until the current record is
> > completed. walreceiver is not conscious of a WAL record and
> > this might break synchronous replication.
>
> Not sure about this one yet.

I'm not sure, too:p

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
retard_keeplogseg.patch text/x-patch 6.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kuntal Ghosh 2017-01-19 09:49:36 Re: parallelize queries containing subplans
Previous Message Dilip Kumar 2017-01-19 09:35:36 Re: parallelize queries containing subplans

Browse pgsql-bugs by date

  From Date Subject
Next Message Ron Ben 2017-01-19 12:30:33 installation of older versions
Previous Message Вадим Акбашев 2017-01-19 09:11:59 Re: Strange influence of default_statistics_target