Re: Logical decoding restart problems

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Subject: Re: Logical decoding restart problems
Date: 2016-08-20 12:59:46
Message-ID: CAMsr+YGWap3QzC7B951a95MR-isxjrtzUwUAqZ_M=zJTRL1rMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 20 August 2016 at 14:56, konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
wrote:

> Thank you for answers.
>
> No, you don't need to recreate them. Just advance your replication
> identifier downstream and request a replay position in the future. Let the
> existing slot skip over unwanted data and resume where you want to start
> replay.
>
> You can advance the replication origins on the peers as you replay
> forwarded xacts from your master.
>
> Have a look at how the BDR code does this during "catchup mode" replay.
>
> So while your problem discussed below seems concerning, you don't have to
> drop and recreate slots like are currently doing.
>
>
> The only reason for recreation of slot is that I want to move it to the
> current "horizont" and skip all pending transaction without explicit
> specification of the restart position.
>

Why not just specify the restart position as the upstream server's xlog
insert position?

Anyway, you _should_ specify the restart position. Otherwise, if there's
concurrent write activity, you might have a gap between when you stop
replaying from your forwarding slot on the recovery node and start
replaying from the other nodes.

Again, really, go read the BDR catchup mode code. Really.

> If I do not drop the slot and just restart replication specifying position
> 0/0 (invalid LSN), then replication will be continued from the current slot
> position in WAL, will not it?
>

The "current slot position" isn't in WAL. It's stored in the replication
slot in pg_replslot/ . But yes, if you pass 0/0 it'll use the stored
confirmed_flush_lsn from the replication slot.

> So there is no way to specify something "start replication from the end
> of WAL", like lseek(0, SEEK_END).
>

Correct, but you can fetch the server's xlog insert position separately and
pass it.

I guess I can see it being a little bit useful to be able to say "start
decoding at the first commit after this command". Send a patch, see if
Andres agrees.

I still think your whole approach is wrong and you need to use replication
origins or similar to co-ordinate a consistent switchover.

> Slot is created by peer node using standard libpq connection with
> database=replication connection string.
>
>
So walsender interface then.

>
>
>> The problem is that for some reasons consistent point is not so
>> consistent and we get partly decoded transactions.
>> I.e. transaction body consists of two UPDATE but reorder_buffer extracts
>> only the one (last) update and sent this truncated transaction to
>> destination causing consistency violation at replica. I started
>> investigation of logical decoding code and found several things which I do
>> not understand.
>>
>
> Yeah, that sounds concerning and shouldn't happen.
>
>
> I looked at replication code more precisely and understand that my first
> concerns were wrong.
> Confirming flush position should not prevent replaying transactions with
> smaller LSNs.
>

Strictly, confirming the flush position does not prevent transactions *with
changes* at lower LSNs. It does prevent replay of transactions that
*commit* with lower LSNs.

> But unfortunately the problem is really present. May be it is caused by
> race conditions (although most logical decoder data is local to backend).
> This is why I will try to create reproducing scenario without multimaster.
>

> Yeh, but unfortunately it happens. Need to understand why...
>

Yes. I think we need a simple standalone test case. I've never yet seen a
partially decoded transaction like this.

> It's all already there. See logical decoding's use of xl_running_xacts.
>
> But how this information is persisted?
>

restart_lsn points to a xl_running_xacts record in WAL. Which is of course
persistent. The restart_lsn is persistent in the replication slot, as is
catalog_xmin and confirmed_flush_lsn.

> What will happen if wal_sender is restarted?
>

That's why the restart_lsn exists. Decoding restarts from the restart_lsn
when you START_REPLICATION on the new walsender. It continues without
sending data to the client until it decodes the first commit >
confirmed_flush_lsn or some greater-than-that LSN that you requested by
passing it to the START_REPLICATION command.

The snapshot builder is also involved; see snapbuild.c and the comments
there.

I'll wait for a test case or some more detail.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2016-08-20 13:24:47 [PATCH] Transaction traceability - txid_status(bigint)
Previous Message Robert Haas 2016-08-20 12:43:10 Re: dsm_unpin_segment