Re: logical replication - still unstable after all these months

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: logical replication - still unstable after all these months
Date: 2017-05-31 09:16:50
Message-ID: 4b3d5554-e778-6c5b-3c8d-e53e5beb9dd6@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 29/05/17 23:06, Mark Kirkwood wrote:
> On 29/05/17 23:14, Petr Jelinek wrote:
>
>> On 29/05/17 03:33, Jeff Janes wrote:
>>
>>> What would you want to look at? Would saving the WAL from the master be
>>> helpful?
>>>
>> Useful info is, logs from provider (mainly the logical decoding logs
>> that mention LSNs), logs from subscriber (the lines about when sync
>> workers finished), contents of the pg_subscription_rel (with srrelid
>> casted to regclass so we know which table is which), and pg_waldump
>> output around the LSNs found in the logs and in the pg_subscription_rel
>> (+ few lines before and some after to get context). It's enough to only
>> care about LSNs for the table(s) that are out of sync.
>>
>
> I have a run that aborted with failure (accounts table md5 mismatch).
> Petr - would you like to have access to the machine ? If so send me you
> public key and I'll set it up.

Thanks to Mark's offer I was able to study the issue as it happened and
found the cause of this.

The busy loop in apply stops at the point when worker shmem state
indicates that table synchronization was finished, but that might not be
visible in the next transaction if it takes long to flush the final
commit to disk so we might ignore couple of transactions for given table
in the main apply because we think it's still being synchronized. This
also explains why I could not reproduce it on my testing machine (fast
ssd disk array where flushes were always fast) and why it happens
relatively rarely because it's one specific commit during the whole
synchronization process that needs to be slow.

So as solution I changed the busy loop in the apply to wait for
in-catalog status rather than in-memory status to make sure things are
really there and flushed.

While working on this I realized that the handover itself is bit more
complex than necessary (especially for debugging and for other people
understanding it) so I did some small changes as part of this patch to
make the sequences of states table goes during synchronization process
to always be the same. This might cause unnecessary update per one table
synchronization process in some cases but that seems like small enough
price to pay for clearer logic. And it also fixes another potential bug
that I identified where we might write wrong state to catalog if main
apply crashed while sync worker was waiting for status update.

I've been running tests on this overnight on another machine where I was
able to reproduce the original issue within few runs (once I found what
causes it) and so far looks good.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Improve-handover-logic-between-sync-and-apply-worker.patch invalid/octet-stream 11.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2017-05-31 09:21:56 Re: logical replication busy-waiting on a lock
Previous Message Alexander Sosna 2017-05-31 08:50:24 Re: Segmentation fault when creating a BRIN, 10beta1