Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: andres(at)anarazel(dot)de
Cc: michael(dot)paquier(at)gmail(dot)com, sfrost(at)snowman(dot)net, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-01-19 01:54:53
Message-ID: 20180119.105453.132600443.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hello,

Thank you for the complrehensive explanation, Michael.

I happened to see another instance of this failure on one of our
client site. The precise steps lead to the situation is not
available but it is reported that it had occurred without
immediate shutdown. As far as I know just starting the standby
after pg_basebackup caused that.

(The standby fetches archives of the primary server so just
swtichg wal segment could break the loop in the case but it's
not always the case.)

At Thu, 18 Jan 2018 11:52:52 -0800, Andres Freund <andres(at)anarazel(dot)de> wrote in <20180118195252(dot)hyxgkb3krcqjzfjm(at)alap3(dot)anarazel(dot)de>
> Hi,
>
> On 2018-01-18 20:58:27 +0900, Michael Paquier wrote:
> > On Wed, Jan 17, 2018 at 07:38:32AM -0500, Stephen Frost wrote:
> > > Thanks for looking at this patch previously. This patch is still in
> > > Needs Review but it's not clear if that's correct or not, but this seems
> > > to be a bug-fix, so it would be great if we could make some progress on
> > > it (particularly since it was reported over a year ago and goes back to
> > > 9.4, according to the thread, from what I can tell).
> > >
> > > Any chance one of you can provide some further thoughts on it or make it
> > > clear if we are waiting for a new patch or if the existing one is still
> > > reasonable and just needs to be reviewed (in which case, perhaps one of
> > > you could help with that..)?
> >
> > Here is my input then with a summary regarding this thread. The base
> > problem here is that WAL segments are not retained on a primary when
> > using a replication slot if a record begins at a segment boundary.
>
> > 1) On the primary side, track correctly the beginning of a record so as
> > when a record runs across multiple segments, the slot does not recycle
> > them.
>
> I think that's a really bad idea. There's no bug on the sender side
> here. The sender allows removal of WAL based on what the receiver
> acks. This requires that all senders have low-level understanding on
> what's being sent - not a good idea, as we e.g. imo want to have tools
> that serve WAL over the streaming protocol from an archive.
>
> > 2) On the standby side, we need to track as well the beginning of the
> > record so as we don't risk recycling things because of restart points.
>
>
> > On this thread have been discussed a couple of approaches:
> > a) The first patch was doing things on the primary side by opening an
> > XLogReader which was in charge to check if the record we are looking at
> > was at a segment boundary. On top of being costly, this basically
> > achieved 1) but failed on 2).
>
> I am very very strongly against this approach.
>
> > b) The second patch that I would like to mention is doing things on the
> > standby side by being able to change a WAL source when fetching a single
> > record so as it is possible to fetch the beginning of a cross-segment
> > record from say an archive or the local xlog, and then request the rest
> > on the primary. This patch can be found in
> > https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp
> > and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
> > it seems to me that this actually breaks timeline switch
> > consistency. The concept of switching a WAL source in the middle of a
> > WAL segment is risky anyway, because we perform sanity checks while
> > switching sources. The bug we are dealing about here is very rare, and
> > seeing a problem like that for a timeline switch is even more rare, but
> > the risk is not zero and we may finish with a corrupted instance, so we
> > should not in my opinion take this approach. Also this actually achieves
> > point 2) above, not completely though, but not 1).
>
> I don't agree with the sentiment about the approach, but I agree there
> might be weaknesses in the implementation (which I have not reviewed). I
> think it's perfectly sensible to require fetching one segment from
> pg_xlog and the next one via another method. Currently the inability to
> do so often leads to us constantly refetching the same segment.

Though I'm still not fully confident, if reading a set of
continuation records from two (or more) sources can lead to
inconsistency, two successve (complete) records are facing the
same risk.

> > An approach I have been thinking about to address points 1) and 2) is to
> > introduce a new messages type in the replication protocol which can be
> > used report as well at which position a replication slot should hold its
> > position. It would have been tempting to extend the existing standby
> > feedback message but we cannot afford a protocol break either. So my
> > proposal is to create a new replication message which can be used in the
> > context of a replication slot instead of the usual standby feedback
> > message, let's call it slot feedback message. This is similar to the
> > standby message, except that it reports as well the LSN the slot should
> > wait for. This results in an addition of... 8 bytes in the message which
> > is a lot. I have thoughts about the possibility to reduce it to 4 bytes
> > but one record can spawn across more than 4GB worth of segments, right?
> > (Note that replication slot data should updated with that instead of the
> > flush LSN).
>
> -0.5, I think this just adds complexity instead of fixing the underlying
> problem.

On the other hand if one logical record must be read from single
source, we need any means to deterrent wal-recycling on the
primary side. Conducting that within the primary side is rejected
by consensus. (There might be smarter way to do that, though.) To
do that from the standby side, just retarding write feedbacks or
this kind of special feedback would be needed. In any way it is
necessary to introduce WAL-record awareness into WAL shipping
process and it's seemingly large complexity.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Will Storey 2018-01-19 01:55:06 Re: BUG #15007: LIMIT not respected in sub-queries
Previous Message Kyotaro HORIGUCHI 2018-01-19 01:00:36 Re: Index-only scan returns incorrect results when using a composite GIST index with a gist_trgm_ops column.

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2018-01-19 03:00:05 Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning
Previous Message Peter Geoghegan 2018-01-19 01:53:57 Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)