Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Andres Freund <andres(at)anarazel(dot)de>, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-01-18 11:58:27
Message-ID: 20180118115827.GA12826@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Wed, Jan 17, 2018 at 07:38:32AM -0500, Stephen Frost wrote:
> Thanks for looking at this patch previously. This patch is still in
> Needs Review but it's not clear if that's correct or not, but this seems
> to be a bug-fix, so it would be great if we could make some progress on
> it (particularly since it was reported over a year ago and goes back to
> 9.4, according to the thread, from what I can tell).
>
> Any chance one of you can provide some further thoughts on it or make it
> clear if we are waiting for a new patch or if the existing one is still
> reasonable and just needs to be reviewed (in which case, perhaps one of
> you could help with that..)?

Here is my input then with a summary regarding this thread. The base
problem here is that WAL segments are not retained on a primary when
using a replication slot if a record begins at a segment boundary. When
in recovery and trying to fetch a record from a source, the standby
would try to fetch a segment from its beginning, but if segments have
been recycled, then the segment could have been recycled, which breaks
the contract that the replication slot provides. If we want to get
things correctly addressed, I think that we want two things:
1) On the primary side, track correctly the beginning of a record so as
when a record runs across multiple segments, the slot does not recycle
them.
2) On the standby side, we need to track as well the beginning of the
record so as we don't risk recycling things because of restart points.

On this thread have been discussed a couple of approaches:
a) The first patch was doing things on the primary side by opening an
XLogReader which was in charge to check if the record we are looking at
was at a segment boundary. On top of being costly, this basically
achieved 1) but failed on 2).
b) The second patch that I would like to mention is doing things on the
standby side by being able to change a WAL source when fetching a single
record so as it is possible to fetch the beginning of a cross-segment
record from say an archive or the local xlog, and then request the rest
on the primary. This patch can be found in
https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp
and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
it seems to me that this actually breaks timeline switch
consistency. The concept of switching a WAL source in the middle of a
WAL segment is risky anyway, because we perform sanity checks while
switching sources. The bug we are dealing about here is very rare, and
seeing a problem like that for a timeline switch is even more rare, but
the risk is not zero and we may finish with a corrupted instance, so we
should not in my opinion take this approach. Also this actually achieves
point 2) above, not completely though, but not 1).

An approach I have been thinking about to address points 1) and 2) is to
introduce a new messages type in the replication protocol which can be
used report as well at which position a replication slot should hold its
position. It would have been tempting to extend the existing standby
feedback message but we cannot afford a protocol break either. So my
proposal is to create a new replication message which can be used in the
context of a replication slot instead of the usual standby feedback
message, let's call it slot feedback message. This is similar to the
standby message, except that it reports as well the LSN the slot should
wait for. This results in an addition of... 8 bytes in the message which
is a lot. I have thoughts about the possibility to reduce it to 4 bytes
but one record can spawn across more than 4GB worth of segments, right?
(Note that replication slot data should updated with that instead of the
flush LSN).

Would love to hear thoughts from others. Of course, feel free to correct
me as needed.
--
Michael

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2018-01-18 12:14:49 Re: Index-only scan returns incorrect results when using a composite GIST index with a gist_trgm_ops column.
Previous Message Kyotaro HORIGUCHI 2018-01-18 10:20:09 Re: Index-only scan returns incorrect results when using a composite GIST index with a gist_trgm_ops column.

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2018-01-18 12:14:49 Re: Index-only scan returns incorrect results when using a composite GIST index with a gist_trgm_ops column.
Previous Message Thomas Munro 2018-01-18 11:14:51 Re: [HACKERS] Planning counters in pg_stat_statements