001_rep_changes.pl timeout on jacana/bowerbird

From: Noah Misch <noah(at)leadboat(dot)com>
To: andrew(dot)dunstan(at)2ndquadrant(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: 001_rep_changes.pl timeout on jacana/bowerbird
Date: 2020-04-26 01:27:48
Message-ID: 20200426012748.GA1529639@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Starting 2019-11-17, jacana and bowerbird (different compiler, same machine?)
have failed four times like this:

# poll_query_until timed out executing this query:
# SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Looks like your test exited with 25 just after 8.
[11:06:11] t/001_rep_changes.pl ..
Dubious, test returned 25 (wstat 6400, 0x1900)
Failed 9/17 subtests

Every such run:
sysname │ snapshot │ branch │ bfurl
───────────┼─────────────────────┼────────┼───────────────────────────────────────────────────────────────────────────────────────────────
bowerbird │ 2019-11-17 15:22:42 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2019-11-17%2015%3A22%3A42
bowerbird │ 2020-01-10 17:30:49 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2020-01-10%2017%3A30%3A49
jacana │ 2020-04-05 00:00:27 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2020-04-05%2000%3A00%3A27
jacana │ 2020-04-16 00:00:27 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2020-04-16%2000%3A00%3A27

The dates and branches suggest either a v13 regression (hence my concern) or a
regression in the underlying machine. In each affected run, other tests
completed at their normal speed, but 001_rep_changes.pl failed after 397s (a
normal run takes <30s). In the publisher log, the failed run[1] is missing
"replication connection authorized", "IDENTIFY_SYSTEM", etc. Subscriber logs
do not differ; failed and good runs both have 'logical replication apply
worker for subscription "tap_sub" has started'. That suggests a subscriber
stuck in ApplyWorkerMain(), somewhere between that log message and the end of
walrcv_connect. I failed to reproduce this. Andrew, are you interested in
attempting to reproduce it and/or identify the blockage?

Thanks,
nm

==== [1] master log with delay (failed run)
...
2020-04-15 20:56:40.334 EDT [5e97ad48.247c:3] 001_rep_changes.pl LOG: statement: DELETE FROM tab_rep
2020-04-15 20:56:40.334 EDT [5e97ad48.247c:4] 001_rep_changes.pl LOG: disconnection: session time: 0:00:00.015 user=pgrunner database=postgres host=127.0.0.1 port=65317
2020-04-15 20:56:40.381 EDT [5e97ad39.1964:4] LOG: received fast shutdown request
2020-04-15 20:56:40.381 EDT [5e97ad39.1964:5] LOG: aborting any active transactions
2020-04-15 20:56:40.396 EDT [5e97ad39.1964:6] LOG: background worker "logical replication launcher" (PID 9920) exited with exit code 1
2020-04-15 20:56:40.396 EDT [5e97ad39.1a7c:1] LOG: shutting down
2020-04-15 20:56:40.538 EDT [5e97ad47.1ce8:9] tap_sub LOG: disconnection: session time: 0:00:00.541 user=pgrunner database=postgres host=127.0.0.1 port=65313
2020-04-15 20:56:40.569 EDT [5e97ad39.1964:7] LOG: database system is shut down
2020-04-15 20:56:40.780 EDT [5e97ad48.c0:1] LOG: starting PostgreSQL 13devel on x86_64-w64-mingw32, compiled by x86_64-w64-mingw32-gcc.exe (x86_64-win32-sjlj-rev0, Built by MinGW-W64 project) 7.3.0, 64-bit
2020-04-15 20:56:40.780 EDT [5e97ad48.c0:2] LOG: listening on IPv4 address "127.0.0.1", port 58418
2020-04-15 20:56:40.811 EDT [5e97ad48.36c:1] LOG: database system was shut down at 2020-04-15 20:56:40 EDT
2020-04-15 20:56:40.873 EDT [5e97ad48.c0:3] LOG: database system is ready to accept connections
2020-04-15 20:56:41.015 EDT [5e97ad49.1888:1] [unknown] LOG: connection received: host=127.0.0.1 port=65319
2020-04-15 20:56:41.029 EDT [5e97ad49.1888:2] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=001_rep_changes.pl
2020-04-15 20:56:41.044 EDT [5e97ad49.1888:3] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
... repeated a total of 1800 times ...
2020-04-15 21:02:58.960 EDT [5e97aec2.21f8:1] [unknown] LOG: connection received: host=127.0.0.1 port=50920
2020-04-15 21:02:58.960 EDT [5e97aec2.21f8:2] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=001_rep_changes.pl
2020-04-15 21:02:58.975 EDT [5e97aec2.21f8:3] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-04-15 21:02:58.975 EDT [5e97aec2.21f8:4] 001_rep_changes.pl LOG: disconnection: session time: 0:00:00.015 user=pgrunner database=postgres host=127.0.0.1 port=50920
<test script gives up hope>
2020-04-15 21:02:59.148 EDT [5e97ad48.c0:4] LOG: received immediate shutdown request
2020-04-15 21:02:59.148 EDT [5e97ad48.2148:1] WARNING: terminating connection because of crash of another server process
2020-04-15 21:02:59.148 EDT [5e97ad48.2148:2] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-04-15 21:02:59.148 EDT [5e97ad48.2148:3] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-04-15 21:02:59.164 EDT [5e97ad48.c0:5] LOG: database system is shut down

==== master log without delay (good run)
...
2020-04-15 11:17:24.868 EDT [5e972584.19a0:3] 001_rep_changes.pl LOG: statement: DELETE FROM tab_rep
2020-04-15 11:17:24.868 EDT [5e972584.19a0:4] 001_rep_changes.pl LOG: disconnection: session time: 0:00:00.015 user=pgrunner database=postgres host=127.0.0.1 port=60986
2020-04-15 11:17:24.931 EDT [5e972575.1594:4] LOG: received fast shutdown request
2020-04-15 11:17:24.931 EDT [5e972575.1594:5] LOG: aborting any active transactions
2020-04-15 11:17:24.947 EDT [5e972575.1594:6] LOG: background worker "logical replication launcher" (PID 9988) exited with exit code 1
2020-04-15 11:17:24.947 EDT [5e972575.185c:1] LOG: shutting down
2020-04-15 11:17:25.072 EDT [5e972584.560:9] tap_sub LOG: disconnection: session time: 0:00:00.636 user=pgrunner database=postgres host=127.0.0.1 port=60983
2020-04-15 11:17:25.104 EDT [5e972575.1594:7] LOG: database system is shut down
2020-04-15 11:17:25.312 EDT [5e972585.164:1] LOG: starting PostgreSQL 13devel on x86_64-w64-mingw32, compiled by x86_64-w64-mingw32-gcc.exe (x86_64-win32-sjlj-rev0, Built by MinGW-W64 project) 7.3.0, 64-bit
2020-04-15 11:17:25.312 EDT [5e972585.164:2] LOG: listening on IPv4 address "127.0.0.1", port 60213
2020-04-15 11:17:25.343 EDT [5e972585.c7c:1] LOG: database system was shut down at 2020-04-15 11:17:25 EDT
2020-04-15 11:17:25.390 EDT [5e972585.164:3] LOG: database system is ready to accept connections
2020-04-15 11:17:25.531 EDT [5e972585.d40:1] [unknown] LOG: connection received: host=127.0.0.1 port=60988
2020-04-15 11:17:25.552 EDT [5e972585.d40:2] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=001_rep_changes.pl
2020-04-15 11:17:25.567 EDT [5e972585.d40:3] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-04-15 11:17:25.583 EDT [5e972585.d40:4] 001_rep_changes.pl LOG: disconnection: session time: 0:00:00.051 user=pgrunner database=postgres host=127.0.0.1 port=60988
2020-04-15 11:17:25.646 EDT [5e972585.1044:1] [unknown] LOG: connection received: host=127.0.0.1 port=60987
2020-04-15 11:17:25.646 EDT [5e972585.1044:2] [unknown] LOG: replication connection authorized: user=pgrunner application_name=tap_sub
2020-04-15 11:17:25.661 EDT [5e972585.1044:3] tap_sub LOG: received replication command: IDENTIFY_SYSTEM
2020-04-15 11:17:25.661 EDT [5e972585.1044:4] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/15C1DF8 (proto_version '1', publication_names '"tap_pub_ins_only"')
2020-04-15 11:17:25.661 EDT [5e972585.1044:5] tap_sub LOG: starting logical decoding for slot "tap_sub"
2020-04-15 11:17:25.661 EDT [5e972585.1044:6] tap_sub DETAIL: Streaming transactions committing after 0/15A7678, reading WAL from 0/15A7678.
2020-04-15 11:17:25.661 EDT [5e972585.1044:7] tap_sub LOG: logical decoding found consistent point at 0/15A7678
2020-04-15 11:17:25.661 EDT [5e972585.1044:8] tap_sub DETAIL: There are no running transactions.
2020-04-15 11:17:25.787 EDT [5e972585.1878:1] [unknown] LOG: connection received: host=127.0.0.1 port=60993
2020-04-15 11:17:25.803 EDT [5e972585.1878:2] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=001_rep_changes.pl
2020-04-15 11:17:25.803 EDT [5e972585.1878:3] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-04-15 11:17:25.819 EDT [5e972585.1878:4] 001_rep_changes.pl LOG: disconnection: session time: 0:00:00.031 user=pgrunner database=postgres host=127.0.0.1 port=60993
2020-04-15 11:17:26.102 EDT [5e972586.fcc:1] [unknown] LOG: connection received: host=127.0.0.1 port=60996
2020-04-15 11:17:26.102 EDT [5e972586.fcc:2] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=001_rep_changes.pl
2020-04-15 11:17:26.117 EDT [5e972586.fcc:3] 001_rep_changes.pl LOG: statement: ALTER PUBLICATION tap_pub_ins_only SET (publish = 'insert, delete')
...

Browse pgsql-hackers by date

  From Date Subject
Next Message Jesse Zhang 2020-04-26 04:41:20 Fix compilation failure against LLVM 11
Previous Message Tomas Vondra 2020-04-26 00:31:37 Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays