Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers
Date: 2018-01-05 06:22:59
Message-ID: CAMsr+YFw4arf=LtaYgVL47J9y_t5uNkPpw1zS8tcbLn6REEJ9A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5 January 2018 at 12:16, Stephen Frost <sfrost(at)snowman(dot)net> wrote:

> Greetings all,
>
> * Masahiko Sawada (sawada(dot)mshk(at)gmail(dot)com) wrote:
> > On Tue, Dec 26, 2017 at 10:03 PM, Petr Jelinek
> > <petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
> > > On 26/12/17 11:13, Masahiko Sawada wrote:
> > >> On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek
> > >> <petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
> > >>
> > >>>>
> > >>>> It's not a problem on crash restart because StartupReorderBuffer
> already
> > >>>> does the required delete.
> > >>>>
> > >>>> ReorderBufferSerializeTXN, which spills the txns to disk, doesn't
> appear
> > >>>> to have any guard to ensure that the segment files don't already
> exist
> > >>>> when it goes to serialize a snapshot. Adding one there would
> probably be
> > >>>> expensive; we don't know the last lsn of the txn yet, so to be
> really
> > >>>> safe we'd have to do a directory listing and scan for any
> txn-$OURXID-*
> > >>>> entries.
> > >>>>
> > >>>> So to fix, I suggest that we should do a
> > >>>> slot-specific StartupReorderBuffer-style deletion when we start a
> new
> > >>>> decoding session on a slot, per attached patch.
> > >>>>
> > >>>> It might be nice to also add a hook on proc exit, so we don't have
> stale
> > >>>> buffers lying around until next decoding session, but I didn't want
> to
> > >>>> add new global state to reorderbuffer.c so I've left that for now.
> > >>>
> > >>>
> > >>> Hmm, can't we simply call the new cleanup function in
> > >>> ReplicationSlotRelease()? That's called during process exit and we
> know
> > >>> there about slot so no extra global variables are needed.
> > >>>
> > >>
> > >> I guess that ReplicationSlotRelease() currently might not get called
> > >> if walsender exits by proc_exit(). ReplicationSlotRelease() can is
> > >> called by some functions such as WalSndErrorCleanup(), but at least in
> > >> the case where wal sender exits due to failed to write data to socket,
> > >> ReplicationSlotRelease() didn't get called as far as I tested.
> > >>
> > >
> > > Are you sure about that testing? Did you test it with replication slot
> > > active? ReplicationSlotRelease() is called from ProcKill() if the
> > > process is using a slot and should be called for any kind of exit
> except
> > > for outright crash (the kind that makes postgres kill all backends). If
> > > it wasn't we'd never unlock the replication slot used by the exiting
> > > walsender.
> >
> > Ah, sorry I was wrong. WalSndErrorCleanup() didn't get called but
> > ReplicationSlotRelease() got called. I agree that cleanup function
> > gets called in ReplicationSlotRelease().
>
> This patch is currently in 'Waiting on Author' state, but looks like it
> should actually be in Needs Review (or perhaps Ready for Commmitter)..?
>
> Craig, would you agree with that and, if so, please update the status
> accordingly.
>
>
WoA looks correct, Petr suggested a tweak to how and when the cleanup is
done that I need to adopt.

I'm just about out of time before a trip, but I'll get a colleague to
update it if I don't get the chance, so we can get it in and backpatched
for this CF.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-01-05 06:33:10 Re: Condition variable live lock
Previous Message Tom Lane 2018-01-05 06:10:57 Re: Condition variable live lock