From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Christoph Berg <myon(at)debian(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Subject: | Re: A failure in 031_recovery_conflict.pl on Debian/s390x |
Date: | 2023-10-09 14:00:00 |
Message-ID: | 8dfdb0bb-b913-28f6-0e1d-4904660471fc@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
13.08.2023 00:00, Andres Freund wrote:
> Hi,
>
> On 2023-08-12 15:50:24 +1200, Thomas Munro wrote:
>> Thanks. I realised that it's easy enough to test that theory about
>> cleanup locks by hacking ConditionalLockBufferForCleanup() to return
>> false randomly. Then the test occasionally fails as described. Seems
>> like we'll need to fix that test, but it's not evidence of a server
>> bug, and my signal handler refactoring patch is in the clear. Thanks
>> for testing it!
> WRT fixing the test: I think just using VACUUM FREEZE ought to do the job?
> After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl
> passes even after I make ConditionalLockBufferForCleanup() fail 100%.
>
I happened to encounter another failure of 031_recovery_conflict, on
rather slow ARMv7:
t/031_recovery_conflict.pl ............ 13/? # poll_query_until timed out executing this query:
#
# SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
#
# expecting this output:
# waiting
# last actual query output:
#
# with stderr:
t/031_recovery_conflict.pl ............ 14/?
# Failed test 'startup deadlock: lock acquisition is waiting'
# at t/031_recovery_conflict.pl line 261.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 15.
t/031_recovery_conflict.pl ............ Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 1/15 subtests
I've managed to reproduce it with the attached test patch, on a regular
x86_64 VM with CPU slowed down to 2%, so that the modified test runs for
5+ minutes.
...
t/031_recovery_conflict.pl .. 102/? # iteration 31
t/031_recovery_conflict.pl .. 104/? # iteration 32
t/031_recovery_conflict.pl .. 106/?
<... waiting for something ...>
031_recovery_conflict_standby.log contains:
2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: BEGIN;
2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: DECLARE test_recovery_conflict_cursor
CURSOR FOR SELECT a FROM test_recovery_conflict_table1;
2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: FETCH FORWARD FROM
test_recovery_conflict_cursor;
2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: SELECT * FROM test_recovery_conflict_table2;
2023-10-09 12:43:15.039 UTC [145513] LOG: recovery still waiting after 10.217 ms: recovery conflict on buffer pin
2023-10-09 12:43:15.039 UTC [145513] CONTEXT: WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0,
nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel
1663/16385/16489, blk 0
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl ERROR: canceling statement due to conflict with recovery
at character 15
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl DETAIL: User transaction caused buffer deadlock with
recovery.
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl STATEMENT: SELECT * FROM test_recovery_conflict_table2;
2023-10-09 12:43:15.039 UTC [145513] LOG: recovery finished waiting after 10.382 ms: recovery conflict on buffer pin
2023-10-09 12:43:15.039 UTC [145513] CONTEXT: WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0,
nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel
1663/16385/16489, blk 0
2023-10-09 12:43:15.134 UTC [145527] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE
locktype = 'relation' AND NOT granted;
2023-10-09 12:43:15.647 UTC [145530] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE
locktype = 'relation' AND NOT granted;
2023-10-09 12:43:15.981 UTC [145532] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE
locktype = 'relation' AND NOT granted;
...
2023-10-09 12:51:24.729 UTC [149175] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE
locktype = 'relation' AND NOT granted;
2023-10-09 12:51:25.969 UTC [145514] FATAL: could not receive data from WAL stream: server closed the connection
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Indeed, t_031_recovery_conflict_standby_data/pgdata/pg_wal/
000000010000000000000003 contains:
...
rmgr: Heap len (rec/tot): 59/ 59, tx: 972, lsn: 0/035ADB18, prev 0/035ADAD8, desc: INSERT off: 100,
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Heap len (rec/tot): 59/ 59, tx: 972, lsn: 0/035ADB58, prev 0/035ADB18, desc: INSERT off: 101,
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Transaction len (rec/tot): 34/ 34, tx: 972, lsn: 0/035ADB98, prev 0/035ADB58, desc: ABORT 2023-10-09
12:43:13.797513 UTC
rmgr: Standby len (rec/tot): 42/ 42, tx: 973, lsn: 0/035ADBC0, prev 0/035ADB98, desc: LOCK xid 973 db
16385 rel 16389
rmgr: Transaction len (rec/tot): 178/ 178, tx: 973, lsn: 0/035ADBF0, prev 0/035ADBC0, desc: PREPARE gid
lock: 2023-10-09 12:43:13.798048 UTC
rmgr: Heap len (rec/tot): 59/ 59, tx: 974, lsn: 0/035ADCA8, prev 0/035ADBF0, desc: INSERT off: 102,
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Transaction len (rec/tot): 34/ 34, tx: 974, lsn: 0/035ADCE8, prev 0/035ADCA8, desc: COMMIT
2023-10-09 12:43:13.798434 UTC
rmgr: Transaction len (rec/tot): 34/ 34, tx: 975, lsn: 0/035ADD10, prev 0/035ADCE8, desc: COMMIT
2023-10-09 12:43:13.900046 UTC
rmgr: Heap2 len (rec/tot): 255/ 255, tx: 0, lsn: 0/035ADD38, prev 0/035ADD10, desc: PRUNE
snapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101],
unused: [], blkref #0: rel 1663/16385/16489 blk 0
rmgr: Heap2 len (rec/tot): 248/ 248, tx: 0, lsn: 0/035ADE38, prev 0/035ADD38, desc: VACUUM nunused:
100, unused: [2, 3, 4, ... , 99, 100, 101], blkref #0: rel 1663/16385/16489 blk 0
So it looks like "SELECT * FROM test_recovery_conflict_table2" in this
case was interrupted due to conflict with replaying PRUNE, not VACUUM,
and as a result, this query hadn't entered "waiting for lock" state.
(I saw also other failures of the test, but probably they have the same
root cause).
Best regards,
Alexander
Attachment | Content-Type | Size |
---|---|---|
031_recovery_conflict-mod.patch | text/x-patch | 1.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Korotkov | 2023-10-09 14:11:25 | Re: On login trigger: take three |
Previous Message | Amul Sul | 2023-10-09 13:26:58 | Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression |