Re: Logical decoding restart problems

From: konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Subject: Re: Logical decoding restart problems
Date: 2016-08-20 06:56:18
Message-ID: 95622833-54E8-41E9-B468-4EC191E82AE9@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for answers.

> No, you don't need to recreate them. Just advance your replication identifier downstream and request a replay position in the future. Let the existing slot skip over unwanted data and resume where you want to start replay.
>
> You can advance the replication origins on the peers as you replay forwarded xacts from your master.
>
> Have a look at how the BDR code does this during "catchup mode" replay.
>
> So while your problem discussed below seems concerning, you don't have to drop and recreate slots like are currently doing.

The only reason for recreation of slot is that I want to move it to the current "horizont" and skip all pending transaction without explicit specification of the restart position.
If I do not drop the slot and just restart replication specifying position 0/0 (invalid LSN), then replication will be continued from the current slot position in WAL, will not it?
So there is no way to specify something "start replication from the end of WAL", like lseek(0, SEEK_END).
Right now I trying to overcome this limitation by explicit calculation of the position from which we should continue replication.
But unfortunately the problem with partly decoded transactions persist.
I will try at next week to create example reproducing the problem without any multimaster stuff, just using standard logical decoding plugin.

>
> To restart logical decoding we first drop existed slot, then create new one and then start logical replication from the WAL position 0/0 (invalid LSN).
> In this case recovery should be started from the last consistent point.
>
> How do you create the new slot? SQL interface? walsender interface? Direct C calls?

Slot is created by peer node using standard libpq connection with database=replication connection string.

>
> The problem is that for some reasons consistent point is not so consistent and we get partly decoded transactions.
> I.e. transaction body consists of two UPDATE but reorder_buffer extracts only the one (last) update and sent this truncated transaction to destination causing consistency violation at replica. I started investigation of logical decoding code and found several things which I do not understand.
>
> Yeah, that sounds concerning and shouldn't happen.

I looked at replication code more precisely and understand that my first concerns were wrong.
Confirming flush position should not prevent replaying transactions with smaller LSNs.
But unfortunately the problem is really present. May be it is caused by race conditions (although most logical decoder data is local to backend).
This is why I will try to create reproducing scenario without multimaster.

>
> Assume that we have transactions T1={start_lsn=100, end_lsn=400} and T2={start_lsn=200, end_lsn=300}.
> Transaction T2 is sent to the replica and replica confirms that flush_lsn=300.
> If now we want to restart logical decoding, we can not start with position less than 300, because CreateDecodingContext doesn't allow it:
>
>
> Right. You've already confirmed receipt of T2, so you can't receive it again.
>
> So it means that we have no chance to restore T1?
>
> Wrong. You can, because the slot's restart_lsn still be will be some LSN <= 100. The slot keeps track of inprogress transactions (using xl_running_xacts records) and knows it can't discard WAL past lsn 100 because xact T1 is still in-progress, so it must be able to decode from the start of it.
>
> When you create a decoding context decoding starts at restart_lsn not at confirmed_flush_lsn. confirmed_flush_lsn is the limit at which commits start resulting in decoded data being sent to you. So in your case, T1 commits at lsn=400, which is >300, so you'll receive the whole xact for T1.

Yeh, but unfortunately it happens. Need to understand why...

>
> It's all already there. See logical decoding's use of xl_running_xacts.

But how this information is persisted?
What will happen if wal_sender is restarted?

>
> --
> Craig Ringer http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-08-20 07:27:27 Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location
Previous Message Craig Ringer 2016-08-20 06:39:28 Re: Most efficient way for libPQ .. PGresult serialization