Re: logical replication - still unstable after all these months

From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Erik Rijkers <er(at)xs4all(dot)nl>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: logical replication - still unstable after all these months
Date: 2017-06-05 02:01:51
Message-ID: ba1ffd89-7046-cfdf-5b08-f6f0834a0825@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05/06/17 13:08, Mark Kirkwood wrote:
> On 05/06/17 00:04, Erik Rijkers wrote:
>
>> On 2017-05-31 16:20, Erik Rijkers wrote:
>>> On 2017-05-31 11:16, Petr Jelinek wrote:
>>> [...]
>>>> Thanks to Mark's offer I was able to study the issue as it happened
>>>> and
>>>> found the cause of this.
>>>>
>>>> [0001-Improve-handover-logic-between-sync-and-apply-worker.patch]
>>>
>>> This looks good:
>>>
>>> -- out_20170531_1141.txt
>>> 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25
>>> 100 -- All is well.
>>>
>>> So this is 100x a 1-minute test with 100x success. (This on the most
>>> fastidious machine (slow disks, meagre specs) that used to give 15%
>>> failures)
>>
>> [Improve-handover-logic-between-sync-and-apply-worker-v2.patch]
>>
>> No errors after (several days of) running variants of this. (2500x 1
>> minute runs; 12x 1-hour runs)
>
> Same here, no errors with the v2 patch applied (approx 2 days - all 1
> minute runs)
>

Further, reapplying the v1 patch (with a bit of editing as I wanted to
apply it to my current master), gets a failure with missing rows in the
history table quite quickly. I'll put back the v2 patch and resume runs
with that, but I'm cautiously optimistic that the v2 patch solves the issue.

regards

Mark

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-06-05 02:18:59 Re: sketchy partcollation handling
Previous Message Andres Freund 2017-06-05 01:42:00 Re: logical replication and PANIC during shutdown checkpoint in publisher