Re: Detecting skipped data from logical slots (data silently skipped)

From: Greg Stark <stark(at)mit(dot)edu>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Subject: Re: Detecting skipped data from logical slots (data silently skipped)
Date: 2016-08-03 13:55:50
Message-ID: CAM-w4HM=XYcMCkKJDMQr-O6Kpf7U4yFnt=7Dhj_2kNLaF=GqvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I didn't follow all of that but I wonder if it isn't just that when you
restore from backup you should be creating a new slot?

On 3 Aug 2016 14:39, "Craig Ringer" <craig(at)2ndquadrant(dot)com> wrote:

> Hi all
>
> I think we have a bit of a problem with the behaviour specified for
> logical slots, one that makes it hard to prevent a outdated snapshot or
> backup of a logical-slot-using downstream from knowing it's missing a chunk
> of data that's been consumed from a slot. That's not great since slots are
> supposed to ensure a continuous, gapless data stream.
>
> If the downstream requests that logical decoding restarts at an LSN older
> than the slot's confirmed_flush_lsn, we silently ignore the client's
> request and start replay at the confirmed_flush_lsn. That's by design and
> fine normally, since we know the gap LSNs contained no transactions of
> interest to the downstream.
>
> But it's *bad* if the downstream is actually a copy of the original
> downstream that's been started from a restored backup/snapshot. In that
> case the downstream won't know that some other client, probably a newer
> instance of its self, consumed rows it should've seen. It'll merrily
> continue replaying and not know it isn't consistent.
>
> The cause is an optimisation intended to allow the downstream to avoid
> having to do local writes and flushes when the upstream's activity isn't of
> interest to it and doesn't result in replicated rows. When the upstream
> does a bunch of writes to another database or otherwise produces WAL not of
> interest to the downstream we send the downstream keepalive messages that
> include the upstream's current xlog position and the client replies to
> acknowledge it's seen the new LSN. But, so that we can avoid disk flushes
> on the downstream, we permit it to skip advancing its replication origin in
> response to those keepalives. We continue to advance the
> confirmed_flush_lsn and restart_lsn in the replication slot on the upstream
> so we can free WAL that's not needed and move the catalog_xmin up. The
> replication origin on the downstream falls behind the confirmed_flush_lsn
> on the upstream.
>
> This means that if the downstream exits/crashes before receiving some new
> row, its replication origin will tell it that it last replayed some LSN
> older than what it really did, and older than what the server retains.
> Logical decoding doesn't allow the client to "jump backwards" and replay
> anything older than the confirmed_lsn. Since we "know" that the gap cannot
> contain rows of interest, otherwise we'd have updated the replication
> origin, we just skip and start replay at the confirmed_flush_lsn.
>
> That means that if the downstream is restored from a backup it has no way
> of knowing it can't rejoin and become consistent because it can't tell the
> difference between "everything's fine, replication origin intentionally
> behind confirmed_flush_lsn due to activity not of interest" and "we've
> missed data consumed from this slot by some other peer and should refuse to
> continue replay".
>
> The simplest fix would be to require downstreams to flush their
> replication origin when they get a hot standby feedback message, before
> they send a reply with confirmation. That could be somewhat painful for
> performance, but can be alleviated somewhat by waiting for the downstream
> postgres to get around to doing a flush anyway and only forcing it if we're
> getting close to the walsender timeout. That's pretty much what BDR and
> pglogical do when applying transactions to avoid having to do a disk flush
> for each and every applied xact. Then we change START_REPLICATION ...
> LOGICAL so it ERRORs if you ask for a too-old LSN rather than silently
> ignoring it.
>
> This problem can also bite you if you restore a copy of a downstream (say,
> to look at since-deleted data) while the original happens to be
> disconnected for some reason. The copy connects to the upstream and
> consumes some data from the slot. Then when the original comes back on line
> it has no idea there's a gap in its time stream.
>
> This came up when investigating issues with people using snapshot-based
> BDR and pglogical backup/restore. It's a real-world problem that can result
> in silent data inconsistency.
>
> Thoughts on the proposed fix? Any ideas for lower-impact fixes that'd
> still allow a downstream to find out if it's missed data?
>
> --
> Craig Ringer http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alfred Perlstein 2016-08-03 13:58:39 Re: Why we lost Uber as a user
Previous Message Craig Ringer 2016-08-03 13:47:39 Re: Way to access LSN (for each transaction) by directly talking to postgres?