Re: Logical replication existing data copy

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-25 08:40:45
Message-ID: 0a4418aff31920c92c1a446ad20d89f3@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-02-25 00:40, Petr Jelinek wrote:

> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
> 0002-Fix-after-trigger-execution-in-logical-replication.patch
> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
> snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
> snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
> snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
> snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch
> 0001-Logical-replication-support-for-initial-data-copy-v6.patch

Here are some results. There is improvement although it's not an
unqualified success.

Several repeat-runs of pgbench_derail2.sh, with different parameters for
number-of-client yielded an output file each.

Those show that logrep is now pretty stable when there is only 1 client
(pgbench -c 1). But it starts making mistakes with 4, 8, 16 clients.
I'll just show a grep of the output files; I think it is
self-explicatory:

Output-files (lines counted with grep | sort | uniq -c):

-- out_20170225_0129.txt
250 -- pgbench -c 1 -j 8 -T 10 -P 5 -n
250 -- All is well.

-- out_20170225_0654.txt
25 -- pgbench -c 4 -j 8 -T 10 -P 5 -n
24 -- All is well.
1 -- Not good, but breaking out of wait (waited more than 60s)

-- out_20170225_0711.txt
25 -- pgbench -c 8 -j 8 -T 10 -P 5 -n
23 -- All is well.
2 -- Not good, but breaking out of wait (waited more than 60s)

-- out_20170225_0803.txt
25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n
11 -- All is well.
14 -- Not good, but breaking out of wait (waited more than 60s)

So, that says:
1 clients: 250x success, zero fail (250 not a typo, ran this overnight)
4 clients: 24x success, 1 fail
8 clients: 23x success, 2 fail
16 clients: 11x success, 14 fail

I want to repeat what I said a few emails back: problems seem to
disappear when a short wait state is introduced (directly after the
'alter subscription sub1 enable' line) to give the logrep machinery time
to 'settle'. It makes one think of a timing error somewhere (now don't
ask me where..).

To show that, here is pgbench_derail2.sh output that waited 10 seconds
(INIT_WAIT in the script) as such a 'settle' period works faultless
(with 16 clients):

-- out_20170225_0852.txt
25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n
25 -- All is well.

QED.

(By the way, no hanged sessions so far, so that's good)

thanks

Erik Rijkers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2017-02-25 09:52:12 Re: Proposal : Parallel Merge Join
Previous Message Amit Kapila 2017-02-25 05:59:19 Re: Proposal : Parallel Merge Join