Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-01-18 19:52:52
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers


On 2018-01-18 20:58:27 +0900, Michael Paquier wrote:
> On Wed, Jan 17, 2018 at 07:38:32AM -0500, Stephen Frost wrote:
> > Thanks for looking at this patch previously. This patch is still in
> > Needs Review but it's not clear if that's correct or not, but this seems
> > to be a bug-fix, so it would be great if we could make some progress on
> > it (particularly since it was reported over a year ago and goes back to
> > 9.4, according to the thread, from what I can tell).
> >
> > Any chance one of you can provide some further thoughts on it or make it
> > clear if we are waiting for a new patch or if the existing one is still
> > reasonable and just needs to be reviewed (in which case, perhaps one of
> > you could help with that..)?
> Here is my input then with a summary regarding this thread. The base
> problem here is that WAL segments are not retained on a primary when
> using a replication slot if a record begins at a segment boundary.

> 1) On the primary side, track correctly the beginning of a record so as
> when a record runs across multiple segments, the slot does not recycle
> them.

I think that's a really bad idea. There's no bug on the sender side
here. The sender allows removal of WAL based on what the receiver
acks. This requires that all senders have low-level understanding on
what's being sent - not a good idea, as we e.g. imo want to have tools
that serve WAL over the streaming protocol from an archive.

> 2) On the standby side, we need to track as well the beginning of the
> record so as we don't risk recycling things because of restart points.

> On this thread have been discussed a couple of approaches:
> a) The first patch was doing things on the primary side by opening an
> XLogReader which was in charge to check if the record we are looking at
> was at a segment boundary. On top of being costly, this basically
> achieved 1) but failed on 2).

I am very very strongly against this approach.

> b) The second patch that I would like to mention is doing things on the
> standby side by being able to change a WAL source when fetching a single
> record so as it is possible to fetch the beginning of a cross-segment
> record from say an archive or the local xlog, and then request the rest
> on the primary. This patch can be found in
> and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
> it seems to me that this actually breaks timeline switch
> consistency. The concept of switching a WAL source in the middle of a
> WAL segment is risky anyway, because we perform sanity checks while
> switching sources. The bug we are dealing about here is very rare, and
> seeing a problem like that for a timeline switch is even more rare, but
> the risk is not zero and we may finish with a corrupted instance, so we
> should not in my opinion take this approach. Also this actually achieves
> point 2) above, not completely though, but not 1).

I don't agree with the sentiment about the approach, but I agree there
might be weaknesses in the implementation (which I have not reviewed). I
think it's perfectly sensible to require fetching one segment from
pg_xlog and the next one via another method. Currently the inability to
do so often leads to us constantly refetching the same segment.

> An approach I have been thinking about to address points 1) and 2) is to
> introduce a new messages type in the replication protocol which can be
> used report as well at which position a replication slot should hold its
> position. It would have been tempting to extend the existing standby
> feedback message but we cannot afford a protocol break either. So my
> proposal is to create a new replication message which can be used in the
> context of a replication slot instead of the usual standby feedback
> message, let's call it slot feedback message. This is similar to the
> standby message, except that it reports as well the LSN the slot should
> wait for. This results in an addition of... 8 bytes in the message which
> is a lot. I have thoughts about the possibility to reduce it to 4 bytes
> but one record can spawn across more than 4GB worth of segments, right?
> (Note that replication slot data should updated with that instead of the
> flush LSN).

-0.5, I think this just adds complexity instead of fixing the underlying


Andres Freund

In response to


Browse pgsql-bugs by date

  From Date Subject
Next Message Mohan, Kumar 2018-01-18 19:55:33 PostgresSQL Support Enquiry.
Previous Message Dave Cramer 2018-01-18 19:33:35 Re: BUG #15013: JNI-JDBC: org.postgresql.util.PSQLException: FEHLER: ungültiges Message-Format

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2018-01-18 21:09:13 Re: CREATE ROUTINE MAPPING
Previous Message Andres Freund 2018-01-18 19:41:00 Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning