Quick Links

Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	masao(dot)fujii(at)gmail(dot)com
Cc:	michael(dot)paquier(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject:	Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?
Date:	2017-02-02 02:28:29
Message-ID:	20170202.112829.188781915.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs pgsql-hackers

Thank you for the comment.

At Thu, 2 Feb 2017 01:26:03 +0900, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote in <CAHGQGwEET=QBA_jND=xhrXn+9ZreP4_qMBAqsBZg56beqxbveg(at)mail(dot)gmail(dot)com>
> > The attached patch does that. Usually it reads page headers only
> > on segment boundaries, but once continuation record found (or
> > failed to read the next page header, that is, the first record on
> > the first page in the next segment has not been replicated), it
> > becomes to happen on every page boundary until non-continuation
> > page comes.
>
> I'm afraid that many WAL segments would start with a continuation record
> when there are the workload of short transactions (e.g., by pgbench), and
> which would make restart_lsn go behind very much. No?

I agreed. So trying to release the lock for every page boundary
but restart_lsn goes behind much if so many contiguous pages were
CONTRECORD. But I think the chance for the situation sticks for
one or more segments is ignorablly low. Being said that, there
*is* possibility of false continuation, anyway.

> The discussion on this thread just makes me think that restart_lsn should
> indicate the replay location instead of flush location. This seems safer.

Standby restarts from minRecoveryPoint, which is a copy of
XLogCtl->replayEndRecPtr and updated by
UpdateMinRecoveryPoint(). Whlie, applyPtr in reply messages is a
copy of XLogCtl->lastReplayedEndRecptr which is updated after the
upate of on-disk minRecoveryPoint. It seems safe from the
viewpoint.

On the other hand, apply is pausable. Records are copied and
flushd on standby then the segments on master that is already
sent are safely be removed even for the case. In spite of that,
older segments on the master are kept from being removed during
the pause. If applyPtr were used as restart_lsn, this could be
another problem and this is sure to happen.

I'm not sure how much possibility is there for several contiguous
segments are full of contpages. But I think it's worse that apply
pause causes needless pg_wal flooding.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Re: Bug in Physical Replication Slots (at least 9.5)? at 2017-02-01 16:26:03 from Fujii Masao

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	crvv.mail	2017-02-02 05:34:38	BUG #14523: Commands which compare with nested subquery expression fails with "should not reference subplan var"
Previous Message	Tom Lane	2017-02-01 18:56:55	Re: BUG #14522: plpythonu, missed filenode

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Higuchi, Daisuke	2017-02-02 02:41:04	Re: [Bug fix] PQsendQuery occurs error when target_session_attrs is set to read-write
Previous Message	Robert Haas	2017-02-02 02:25:24	Re: WAL consistency check facility