Re: Logical replication existing data copy

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-15 23:43:58
Message-ID: 16135dcb-0b52-2347-2173-9fb2cfeef7ad@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13/02/17 14:51, Erik Rijkers wrote:
> On 2017-02-11 11:16, Erik Rijkers wrote:
>> On 2017-02-08 23:25, Petr Jelinek wrote:
>>
>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
>>> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
>>> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
>>> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
>>> 0001-Logical-replication-support-for-initial-data-copy-v4.patch
>>
>> This often works but it also fails far too often (in my hands). I
>> test whether the tables are identical by comparing an md5 from an
>> ordered resultset, from both replica and master. I estimate that 1 in
>> 5 tries fail; 'fail' being a somewhat different table on replica
>> (compared to mater), most often pgbench_accounts (typically there are
>> 10-30 differing rows). No errors or warnings in either logfile. I'm
>> not sure but I think testing on faster machines seem to be doing
>> somewhat better ('better' being less replication error).
>>
>
> I have noticed that when I insert a few seconds wait-state after the
> create subscription (or actually: the 'enable'ing of the subscription)
> the problem does not occur. Apparently, (I assume) the initial snapshot
> occurs somewhere when the subsequent pgbench-run has already started, so
> that the logical replication also starts somewhere 'into' that
> pgbench-run. Does that make sense?
>
> I don't know what to make of it. Now that I think that I understand
> what happens I hesitate to call it a bug. But I'd say it's still a
> useability problem that the subscription is only 'valid' after some
> time, even if it's only a few seconds.
>

It is a bug, we are going to great lengths to create data snapshot that
corresponds to specific LSN so that we are able to decode exactly the
changes that happened since the data snapshot was taken. And the
tablecopy.c does quite a lot to synchronize table handover to main apply
process so that there is correct continuation of data stream as well. So
the end result is that concurrent changes are supposed to be okay and
eventually replication should catch up and the contents should be the same.

That being said, I am so far having problems reproducing this on my test
machine(s) so no idea what causes it yet.

Could you periodically dump contents of the pg_subscription_rel on
subscriber (ideally when dumping the md5 of the data) and attach that as
well?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-02-15 23:51:03 Re: bytea_output vs make installcheck
Previous Message Andres Freund 2017-02-15 23:32:50 Re: bytea_output vs make installcheck