Re: Logical decoding restart problems

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Subject: Re: Logical decoding restart problems
Date: 2016-08-20 06:24:52
Message-ID: CAMsr+YFVyF1Bq=VDvKoofQVeFZVh-XdZiJE7n1km=e-ggwSfDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19 August 2016 at 15:34, konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
wrote:

> Hi,
>
> We are using logical decoding in multimaster and we are faced with the
> problem that inconsistent transactions are sent to replica.
> Briefly, multimaster is using logical decoding in this way:
> 1. Each multimaster node is connected with each other using logical
> decoding channel and so each pair of nodes
> has its own replication slot.
>

Makes sense. Same as BDR.

> 2. In normal scenario each replication channel is used to replicate only
> those transactions which were originated at the source node.
> We are using origin mechanism to skip "foreign" transactions.
>

Again, makes sense. Same as BDR.

> 2. When offline cluster node is returned back to the multimaster we need
> to recover this node to the current cluster state.
> Recovery is performed from one of the cluster's node. So we are using only
> one replication channel to receive all (self and foreign) transactions.
>

I'm planning on doing this for BDR soon, for the case where we lose a node
unrecoverably, and it's what we already do during node join for the same
reasons you're doing it. Glad to hear you're doing something similar.

> Only in this case we can guarantee consistent order of applying
> transactions at recovered node.
> After the end of recovery we need to recreate replication slots with all
> other cluster nodes (because we have already replied transactions from this
> nodes).
>

No, you don't need to recreate them. Just advance your replication
identifier downstream and request a replay position in the future. Let the
existing slot skip over unwanted data and resume where you want to start
replay.

You can advance the replication origins on the peers as you replay
forwarded xacts from your master.

Have a look at how the BDR code does this during "catchup mode" replay.

So while your problem discussed below seems concerning, you don't have to
drop and recreate slots like are currently doing.

To restart logical decoding we first drop existed slot, then create new one
> and then start logical replication from the WAL position 0/0 (invalid LSN).
> In this case recovery should be started from the last consistent point.
>

How do you create the new slot? SQL interface? walsender interface? Direct
C calls?

> The problem is that for some reasons consistent point is not so consistent
> and we get partly decoded transactions.
> I.e. transaction body consists of two UPDATE but reorder_buffer extracts
> only the one (last) update and sent this truncated transaction to
> destination causing consistency violation at replica. I started
> investigation of logical decoding code and found several things which I do
> not understand.
>

Yeah, that sounds concerning and shouldn't happen.

Assume that we have transactions T1={start_lsn=100, end_lsn=400} and
> T2={start_lsn=200, end_lsn=300}.
> Transaction T2 is sent to the replica and replica confirms that
> flush_lsn=300.
> If now we want to restart logical decoding, we can not start with position
> less than 300, because CreateDecodingContext doesn't allow it:
>
>
Right. You've already confirmed receipt of T2, so you can't receive it
again.

> So it means that we have no chance to restore T1?
>

Wrong. You can, because the slot's restart_lsn still be will be some LSN <=
100. The slot keeps track of inprogress transactions (using
xl_running_xacts records) and knows it can't discard WAL past lsn 100
because xact T1 is still in-progress, so it must be able to decode from the
start of it.

When you create a decoding context decoding starts at restart_lsn not at
confirmed_flush_lsn. confirmed_flush_lsn is the limit at which commits
start resulting in decoded data being sent to you. So in your case, T1
commits at lsn=400, which is >300, so you'll receive the whole xact for T1.

> What is worse, if there are valid T2 transaction records with lsn >= 300,
> then we can partly decode T1 and send this T1' to the replica.
> I missed something here?
>

That shouldn't be possible. Have you seen this in the wild?

If so can you boil it down to a test case separate to your whole MM
framework?

> Are there any alternative way to "seek" slot to the proper position
> without actual fetching data from it or recreation of the slot?
>

Yes, send replication feedback, or just request a future LSN at
START_REPLICATION time.

You cannot rewind a slot; there's no way to seek backwards. At all. The
only backwards movement is that done automatically and internally within
the slot in the form of decoding restarting at restart_lsn at reconnect
time.

> Is there any mechanism in xlog which can enforce consistent decoding of
> transaction (so that no transaction records are missed)?
>

It's all already there. See logical decoding's use of xl_running_xacts.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2016-08-20 06:33:33 Re: Logical decoding restart problems
Previous Message Claudio Freire 2016-08-20 05:26:26 Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location