Detecting skipped data from logical slots (data silently skipped)

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Subject: Detecting skipped data from logical slots (data silently skipped)
Date: 2016-08-03 13:39:22
Message-ID: CAMsr+YF7RmCgbaTaLgCHhdaA89=p9r3UegGFaVdJA1GBM-gB1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all

I think we have a bit of a problem with the behaviour specified for logical
slots, one that makes it hard to prevent a outdated snapshot or backup of a
logical-slot-using downstream from knowing it's missing a chunk of data
that's been consumed from a slot. That's not great since slots are supposed
to ensure a continuous, gapless data stream.

If the downstream requests that logical decoding restarts at an LSN older
than the slot's confirmed_flush_lsn, we silently ignore the client's
request and start replay at the confirmed_flush_lsn. That's by design and
fine normally, since we know the gap LSNs contained no transactions of
interest to the downstream.

But it's *bad* if the downstream is actually a copy of the original
downstream that's been started from a restored backup/snapshot. In that
case the downstream won't know that some other client, probably a newer
instance of its self, consumed rows it should've seen. It'll merrily
continue replaying and not know it isn't consistent.

The cause is an optimisation intended to allow the downstream to avoid
having to do local writes and flushes when the upstream's activity isn't of
interest to it and doesn't result in replicated rows. When the upstream
does a bunch of writes to another database or otherwise produces WAL not of
interest to the downstream we send the downstream keepalive messages that
include the upstream's current xlog position and the client replies to
acknowledge it's seen the new LSN. But, so that we can avoid disk flushes
on the downstream, we permit it to skip advancing its replication origin in
response to those keepalives. We continue to advance the
confirmed_flush_lsn and restart_lsn in the replication slot on the upstream
so we can free WAL that's not needed and move the catalog_xmin up. The
replication origin on the downstream falls behind the confirmed_flush_lsn
on the upstream.

This means that if the downstream exits/crashes before receiving some new
row, its replication origin will tell it that it last replayed some LSN
older than what it really did, and older than what the server retains.
Logical decoding doesn't allow the client to "jump backwards" and replay
anything older than the confirmed_lsn. Since we "know" that the gap cannot
contain rows of interest, otherwise we'd have updated the replication
origin, we just skip and start replay at the confirmed_flush_lsn.

That means that if the downstream is restored from a backup it has no way
of knowing it can't rejoin and become consistent because it can't tell the
difference between "everything's fine, replication origin intentionally
behind confirmed_flush_lsn due to activity not of interest" and "we've
missed data consumed from this slot by some other peer and should refuse to
continue replay".

The simplest fix would be to require downstreams to flush their replication
origin when they get a hot standby feedback message, before they send a
reply with confirmation. That could be somewhat painful for performance,
but can be alleviated somewhat by waiting for the downstream postgres to
get around to doing a flush anyway and only forcing it if we're getting
close to the walsender timeout. That's pretty much what BDR and pglogical
do when applying transactions to avoid having to do a disk flush for each
and every applied xact. Then we change START_REPLICATION ... LOGICAL so it
ERRORs if you ask for a too-old LSN rather than silently ignoring it.

This problem can also bite you if you restore a copy of a downstream (say,
to look at since-deleted data) while the original happens to be
disconnected for some reason. The copy connects to the upstream and
consumes some data from the slot. Then when the original comes back on line
it has no idea there's a gap in its time stream.

This came up when investigating issues with people using snapshot-based BDR
and pglogical backup/restore. It's a real-world problem that can result in
silent data inconsistency.

Thoughts on the proposed fix? Any ideas for lower-impact fixes that'd still
allow a downstream to find out if it's missed data?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2016-08-03 13:43:12 Re: Why we lost Uber as a user
Previous Message Bruce Momjian 2016-08-03 13:33:07 Re: Why we lost Uber as a user