Re: Physical replication slot advance is not persistent

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Subject: Re: Physical replication slot advance is not persistent
Date: 2020-01-21 06:57:33
Message-ID: CAMsr+YHNWq+U9EFKTcUN=AB_F0g_ppCo=1zcnymGcPJfW63b1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

((On Tue, 21 Jan 2020 at 11:06, Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> > The replication interface should not immediately flush changes to the
> > slot replay position on advance. It should be marked dirty and left to
> > be flushed by the next checkpoint. Doing otherwise potentially
> > introduces a lot of unnecessary fsync()s and may have an unpleasant
> > impact on performance.
>
> Some portions of the advancing code tells a different story. It seems
> to me that the intention behind the first implementation of slot
> advancing was to get things flushed if any advancing was done.

Taking a step back here, I have no concerns with proposed changes for
pg_replication_slot_advance(). Disregard my comments about safety with
the SQL interface for the purposes of this thread, they apply only to
logical slots and are really unrelated to
pg_replication_slot_advance().

Re your comment above: For slot advances in general the flush to disk
is done lazily for performance reasons, but I think you meant
pg_replication_slot_advance() specifically.

pg_replication_slot_advance() doesn't appear to make any promises as
to immediate durability either way. It updates the required LSN
immediately with ReplicationSlotsUpdateRequiredLSN() so it
theoretically marks WAL as removable before it's flushed. But I don't
think we'll ever actually remove any WAL segments until checkpoint, at
which point we'll also flush any dirty slots, so it doesn't really
matter. For logical slots the lsn and xmin are both protected by the
effective/actual tracking logic and can't advance until the slot is
flushed.

The app might be surprised if the slot goes backwards after an
pg_replication_slot_advance() followed by a server crash though.

> The
> check doing that is actually broken from the start, but that's another
> story. Could you check with Petr what was the intention here or drag
> his attention to this thread? He is the original author of the
> feature. So his output would be nice to have.

I'll ask him. He's pretty bogged at the moment though, and I've done a
lot of work in this area too. (See e.g. the catalog_xmin in hot
standby feedback changes).

> > The SQL interface advances the slot restart position and marks the
> > slot dirty *before the client has confirmed receipt of the data and
> > flushed it to disk*. So there's a data loss window. If the client
> > disconnects or crashes before all the data from that function call is
> > safely flushed to disk it may lose the data, then be unable to fetch
> > it again from the server because of the restart_lsn position advance.
>
> Well, you have the same class of problems with for example synchronous
> replication. The only state a client can be sure of is that it
> received a confirmation that the operation happens and completed.
> Any other state tells that the operation may have happened. Or not.
> Now, being sure that the data of the new slot has been flushed once
> the advancing function is done once the client got the confirmation
> that the work is done is a property which could be interesting to some
> class of applications.

That's what we already provide for the streaming interface for slot access.

I agree there's no need to shove a fix to the SQL interface for
phys/logical slots into this same discussion. I'm just trying to make
sure we don't "fix" a "bug" that's actually an important part of the
design by trying to fix a perceived-missing flush in the streaming
interface too. I am not at all confident that the test coverage for
this is sufficient right now, since we lack a good way to make
postgres delay various lazy internal activity to let us reliably
examine intermediate states in a race-free way, so I'm not sure tests
would catch it.

> > Really, we should add a "no_advance_position" option to the SQL
> > interface, then expect the client to call a second function that
> > explicitly advances the restart_lsn and confirmed_flush_lsn. Otherwise
> > no SQL interface client can be crashsafe.
>
> Hm. Could you elaborate more what you mean here? I am not sure to
> understand. Note that calling the advance function multiple times has
> no effects, and the result returned is the actual restart_lsn (or
> confirmed_flush_lsn of course).

I've probably confused things a bit here. I don't mind if whether or
not pg_replication_slot_advance() forces an immediate flush, I think
there are reasonable arguments in both directions.

In the above I was talking about how pg_logical_slot_get_changes()
presently advances the slot position immediately, so if the client
loses its connection before reading and flushing all the data it may
be unable to recover. And while pg_logical_slot_peek_changes() lets
the app read the data w/o advancing the slot, it has to then do a
separate pg_replication_slot_advance() which has to do the decoding
work again. I'd like to improve that, but I didn't intend to confuse
or sidetrack this discussion in the process. Sorry.

We don't have a SQL-level interface for reading physical WAL so there
are no corresponding concerns there.

--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2020-01-21 07:07:31 Re: pg13 PGDLLIMPORT list
Previous Message Fujii Masao 2020-01-21 06:41:54 Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node