Re: snapbuild woes

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Subject: Re: snapbuild woes
Date: 2017-05-08 22:03:12
Message-ID: 076a3e4c3d4fef091522d3cd7e50c7eb@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-05-05 02:00, Andres Freund wrote:
>
> Could you have a look?

Running tests with these three patches:

> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch +
> fix-statistics-reporting-in-logical-replication-work.patch
(on top of 44c528810)

I test by 15-minute pgbench runs while there is a logical replication
connection. Primary and replica are on the same machine.

I have seen errors on 3 different machines (where error means: at least
1 of the 4 pgbench tables is not md5-equal). It seems better, faster
machines yield less errors.

Normally I see in pg_stat_replication (on master) one process in state
'streaming'.

pid | wal | replay_loc | diff | state | app |
sync_state
16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 |
async

Often there are another two processes in pg_stat_replication that remain
in state 'startup'.

In the failing sessions the 'streaming'-state process is missing; in
failing sessions there are only the two processes that are and remain in
'startup'.

FWIW, below the output of a succesful and a failed run:

------------------ successful run:
creating tables...
1590400 of 2500000 tuples (63%) done (elapsed 5.34 s, remaining 3.05 s)
2500000 of 2500000 tuples (100%) done (elapsed 9.63 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
create publication pub1 for all tables;
create subscription sub1 connection 'port=6972 application_name=derail2'
publication pub1 with (disabled);
alter subscription sub1 enable;
-- pgbench -c 90 -j 8 -T 900 -P 180 -n -- scale 25
progress: 180.0 s, 82.5 tps, lat 1086.845 ms stddev 3211.785
progress: 360.0 s, 25.4 tps, lat 3469.040 ms stddev 6297.440
progress: 540.0 s, 28.9 tps, lat 3131.438 ms stddev 4288.130
progress: 720.0 s, 27.5 tps, lat 3285.024 ms stddev 4113.841
progress: 900.0 s, 47.2 tps, lat 1896.698 ms stddev 2182.695
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 25
query mode: simple
number of clients: 90
number of threads: 8
duration: 900 s
number of transactions actually processed: 38175
latency average = 2128.606 ms
latency stddev = 3948.634 ms
tps = 42.151205 (including connections establishing)
tps = 42.151589 (excluding connections establishing)
-- waiting 0s... (always)
port | pg_stat_replication | pid | wal | replay_loc | diff |
?column? | state | app | sync_state
6972 | pg_stat_replication | 2545 | 18/432B2180 | 18/432B2180 | 0 |
t | streaming | derail2 | async

2017.05.08 23:19:22
-- getting md5 (cb)
6972 a,b,t,h: 2500000 25 250 38175 b2ba48b53 b3788a837
d1afac950 d4abcc72e master
6973 a,b,t,h: 2500000 25 250 38175 b2ba48b53 b3788a837
d1afac950 d4abcc72e replica ok bee2312c7
2017.05.08 23:20:48

port | pg_stat_replication | pid | wal | replay_loc | diff
| ?column? | state | app | sync_state
6972 | pg_stat_replication | 2545 | 18/4AEEC8C0 | 18/453FBD20 |
95357856 | f | streaming | derail2 | async
------------------------------------

------------------ failure:
creating tables...
1777100 of 2500000 tuples (71%) done (elapsed 5.06 s, remaining 2.06 s)
2500000 of 2500000 tuples (100%) done (elapsed 7.41 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
create publication pub1 for all tables;
create subscription sub1 connection 'port=6972 application_name=derail2'
publication pub1 with (disabled);
alter subscription sub1 enable;
port | pg_stat_replication | pid | wal | replay_loc | diff |
?column? | state | app | sync_state
6972 | pg_stat_replication | 11945 | 18/5E2913D0 | | |
| catchup | derail2 | async

-- pgbench -c 90 -j 8 -T 900 -P 180 -n -- scale 25
progress: 180.0 s, 78.4 tps, lat 1138.348 ms stddev 2884.815
progress: 360.0 s, 69.2 tps, lat 1309.716 ms stddev 2594.231
progress: 540.0 s, 59.0 tps, lat 1519.146 ms stddev 2033.400
progress: 720.0 s, 62.9 tps, lat 1421.854 ms stddev 1775.066
progress: 900.0 s, 57.0 tps, lat 1575.693 ms stddev 1681.800
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 25
query mode: simple
number of clients: 90
number of threads: 8
duration: 900 s
number of transactions actually processed: 58846
latency average = 1378.259 ms
latency stddev = 2304.159 ms
tps = 65.224168 (including connections establishing)
tps = 65.224788 (excluding connections establishing)
-- waiting 0s... (always)
port | pg_stat_replication | pid | wal | replay_loc | diff |
?column? | state | app | sync_state
6972 | pg_stat_replication | 11948 | 18/7469A038 | | |
| startup | derail2 | async
6972 | pg_stat_replication | 12372 | 18/7469A038 | | |
| startup | derail2 | async

------------------------------------

During my tests, I keep an eye on pg_stat_replication (refreshing every
2s), and when I see those two 'startup'-lines in pg_stat_replication
without any 'streaming'-line I know the test is going to fail. I
believe this has been true for all failure cases that I've seen (except
the much more rare stuck-DROP-SUBSCRIPTION which is mentioned in another
thread).

Sorry, I have not been able to get any thing more clear or definitive...

thanks,

Erik Rijkers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-05-09 00:32:01 Re: Logical replication - TRAP: FailedAssertion in pgstat.c
Previous Message Petr Jelinek 2017-05-08 21:55:29 Re: logical replication syntax (was DROP SUBSCRIPTION, query cancellations and slot handling)