Re: Logical replication existing data copy

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-29 08:14:28
Message-ID: 03a1305e84a17ad729f4075fcf3f4767@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-03-09 11:06, Erik Rijkers wrote:
>>>
>>> I use three different machines (2 desktop, 1 server) to test logical
>>> replication, and all three have now at least once failed to correctly
>>> synchronise a pgbench session (amidst many succesful runs, of course)
>>

(At the moment using tese patches for tests:)

> 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
> 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch +
> 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch +
> 0005-Skip-unnecessary-snapshot-builds.patch +

The failed tests that I kept seeing (see the
pgbench-over-logical-replication tests upthread) were never really
'solved'.

But I have now finally figured out what caused these unexpected failed
tests: it was wal_sender_timeout or rather, its default of 60 s.

This caused 'terminating walsender process due to replication timeout'
on the primary (not strictly an error), and the concomittant ERROR on
the replica: 'could not receive data from WAL stream: server closed the
connection unexpectedly'.

here is a typical example (primary/replica logs time-intertwined, with
'primary'):

[...]
2017-03-24 16:21:38.129 CET [15002] primary LOG: using stale
statistics instead of current ones because stats collector is not
responding
2017-03-24 16:21:42.690 CET [27515] primary LOG: using stale
statistics instead of current ones because stats collector is not
responding
2017-03-24 16:21:42.965 CET [14999] replica LOG: using stale
statistics instead of current ones because stats collector is not
responding
2017-03-24 16:21:49.816 CET [14930] primary LOG: terminating
walsender process due to
2017-03-24 16:21:49.817 CET [14926] replica ERROR: could not
receive data from WAL stream: server closed the connection unexpectedly
2017-03-24 16:21:49.824 CET [27502] replica LOG: worker process:
logical replication worker for subscription 24864 (PID 14926) exited
with exit code 1
2017-03-24 16:21:49.824 CET [27521] replica LOG: starting logical
replication worker for subscription "sub1"
2017-03-24 16:21:49.828 CET [15008] replica LOG: logical
replication apply for subscription sub1 started
2017-03-24 16:21:49.832 CET [15009] primary LOG: received
replication command: IDENTIFY_SYSTEM
2017-03-24 16:21:49.832 CET [15009] primary LOG: received
replication command: START_REPLICATION SLOT "sub1" LOGICAL 3/FC976440
(proto_version '1', publication_names '"pub1"')
2017-03-24 16:21:49.833 CET [15009] primary DETAIL: streaming
transactions committing after 3/FC889810, reading WAL from 3/FC820FC0
2017-03-24 16:21:49.833 CET [15009] primary LOG: starting logical
decoding for slot "sub1"
2017-03-24 16:21:50.471 CET [15009] primary DETAIL: Logical
decoding will begin using saved snapshot.
2017-03-24 16:21:50.471 CET [15009] primary LOG: logical decoding
found consistent point at 3/FC820FC0
2017-03-24 16:21:51.169 CET [15008] replica DETAIL: Key
(hid)=(9014) already exists.
2017-03-24 16:21:51.169 CET [15008] replica ERROR: duplicate key
value violates unique constraint "pgbench_history_pkey"
2017-03-24 16:21:51.170 CET [27502] replica LOG: worker process:
logical replication worker for subscription 24864 (PID 15008) exited
with exit code 1
2017-03-24 16:21:51.170 CET [27521] replica LOG: starting logical
replication worker for subscription "sub1"
[...]

My primary and replica were always on a single machine (making it more
likely that that timeout is reached?). In my testing it seems that
reaching the timeout on the primary (and 'closing the connection
unexpectedly' on the replica) does not necessarily break the logical
replication. But almost all log-rep failures that I have seen were
started by this sequence of events.

After setting wal_sender_timeout to 3 minutes there were no more
failed tests.

Perhaps it warrants setting wal_sender_timeout a bit higher than the
current default of 60 seconds? After all I also saw the 'replication
timeout' / 'closed the connection' couple rather often during
not-failing tests. (These also disappeared, almost completely, with a
higher setting of wal_sender_timeout)

In any case it would be good to mention the setting (and its potentially
deteriorating effect) somehere nearer the logical replication treatment.

( I read about wal_sender_timeout and keepalive ping, perhaps there's
(still) something amiss there? Just a guess, I don't know )

As I said, I saw no more failures with the higher 3 minute setting, with
one exception: the one test that straddled the DST change (saterday 24
march 02:00 h). I am happy to discount that one failure but strictly
speaking I suppose it should be able to take DST into its stride.

Thanks,

Erik Rijkers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-03-29 08:21:26 Re: Partitioned tables and relfilenode
Previous Message Alexander Law 2017-03-29 08:11:01 [PATCH] Remove trailing spaces