Re: Two-phase update of restart_lsn in LogicalConfirmReceivedLocation

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Two-phase update of restart_lsn in LogicalConfirmReceivedLocation
Date: 2018-03-08 01:48:50
Message-ID: CAMsr+YESPn5ER0Of9i53vZcHHJ69+76yCBizUqNKAGWD2HQcdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8 March 2018 at 07:32, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Thu, Mar 1, 2018 at 2:03 AM, Craig Ringer <craig(at)2ndquadrant(dot)com>
> wrote:
> >> So I can't say it's definitely impossible. It seems astonishingly
> unlikely,
> >> but that's not always good enough.
>
> > Race conditions tend to happen a lot more often than one might think.
>
> Just to back that up --- we've seen cases where people could repeatably
> hit race-condition windows that are just an instruction or two wide.
> The first one I came to in an idle archive search is
> https://www.postgresql.org/message-id/15543.1130714273%40sss.pgh.pa.us
> I vaguely recall others but don't feel like digging harder right now.
>
>
That's astonishing.

I guess if you repeat something enough times...

The reason I'm less concerned about this one is that you have to crash in
exactly the wrong place, *while* during a badly timed point in a race. But
the downside is that the result would be an unusable logical slot.

The simplest solution is probably just to mark the slot dirty while we hold
the spinlock, at the same time we advance its restart lsn. Any checkpoint
will then CheckPointReplicationSlots() and flush it. We don't
remove/recycle xlog segments until after that's done in CheckPointGuts() so
it's guaranteed that the slot's new state will be on disk and we can never
have a stale restart_lsn pointing into truncated-away WAL.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-03-08 01:51:21 Re: PATCH: Configurable file mode mask
Previous Message Andres Freund 2018-03-08 01:42:10 Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)