Re: A failure in 031_recovery_conflict.pl on Debian/s390x

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Christoph Berg <myon(at)debian(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: A failure in 031_recovery_conflict.pl on Debian/s390x
Date: 2023-10-09 14:00:00
Message-ID: 8dfdb0bb-b913-28f6-0e1d-4904660471fc@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

13.08.2023 00:00, Andres Freund wrote:
> Hi,
>
> On 2023-08-12 15:50:24 +1200, Thomas Munro wrote:
>> Thanks. I realised that it's easy enough to test that theory about
>> cleanup locks by hacking ConditionalLockBufferForCleanup() to return
>> false randomly. Then the test occasionally fails as described. Seems
>> like we'll need to fix that test, but it's not evidence of a server
>> bug, and my signal handler refactoring patch is in the clear. Thanks
>> for testing it!
> WRT fixing the test: I think just using VACUUM FREEZE ought to do the job?
> After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl
> passes even after I make ConditionalLockBufferForCleanup() fail 100%.
>

I happened to encounter another failure of 031_recovery_conflict, on
rather slow ARMv7:
t/031_recovery_conflict.pl ............ 13/? # poll_query_until timed out executing this query:
#
# SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
#
# expecting this output:
# waiting
# last actual query output:
#
# with stderr:
t/031_recovery_conflict.pl ............ 14/?
#   Failed test 'startup deadlock: lock acquisition is waiting'
#   at t/031_recovery_conflict.pl line 261.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 15.
t/031_recovery_conflict.pl ............ Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 1/15 subtests

I've managed to reproduce it with the attached test patch, on a regular
x86_64 VM with CPU slowed down to 2%, so that the modified test runs for
5+ minutes.
...
t/031_recovery_conflict.pl .. 102/? # iteration 31
t/031_recovery_conflict.pl .. 104/? # iteration 32
t/031_recovery_conflict.pl .. 106/?
<... waiting for something ...>

031_recovery_conflict_standby.log contains:
2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: BEGIN;
2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: DECLARE test_recovery_conflict_cursor
CURSOR FOR SELECT a FROM test_recovery_conflict_table1;
2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: FETCH FORWARD FROM
test_recovery_conflict_cursor;
2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: SELECT * FROM test_recovery_conflict_table2;
2023-10-09 12:43:15.039 UTC [145513] LOG:  recovery still waiting after 10.217 ms: recovery conflict on buffer pin
2023-10-09 12:43:15.039 UTC [145513] CONTEXT:  WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0,
nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel
1663/16385/16489, blk 0
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl ERROR:  canceling statement due to conflict with recovery
at character 15
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl DETAIL:  User transaction caused buffer deadlock with
recovery.
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl STATEMENT:  SELECT * FROM test_recovery_conflict_table2;
2023-10-09 12:43:15.039 UTC [145513] LOG:  recovery finished waiting after 10.382 ms: recovery conflict on buffer pin
2023-10-09 12:43:15.039 UTC [145513] CONTEXT:  WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0,
nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel
1663/16385/16489, blk 0
2023-10-09 12:43:15.134 UTC [145527] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE
locktype = 'relation' AND NOT granted;
2023-10-09 12:43:15.647 UTC [145530] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE
locktype = 'relation' AND NOT granted;
2023-10-09 12:43:15.981 UTC [145532] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE
locktype = 'relation' AND NOT granted;
...
2023-10-09 12:51:24.729 UTC [149175] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE
locktype = 'relation' AND NOT granted;
2023-10-09 12:51:25.969 UTC [145514] FATAL:  could not receive data from WAL stream: server closed the connection
unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.

Indeed, t_031_recovery_conflict_standby_data/pgdata/pg_wal/
000000010000000000000003 contains:
...
rmgr: Heap        len (rec/tot):     59/    59, tx:        972, lsn: 0/035ADB18, prev 0/035ADAD8, desc: INSERT off: 100,
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Heap        len (rec/tot):     59/    59, tx:        972, lsn: 0/035ADB58, prev 0/035ADB18, desc: INSERT off: 101,
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Transaction len (rec/tot):     34/    34, tx:        972, lsn: 0/035ADB98, prev 0/035ADB58, desc: ABORT 2023-10-09
12:43:13.797513 UTC
rmgr: Standby     len (rec/tot):     42/    42, tx:        973, lsn: 0/035ADBC0, prev 0/035ADB98, desc: LOCK xid 973 db
16385 rel 16389
rmgr: Transaction len (rec/tot):    178/   178, tx:        973, lsn: 0/035ADBF0, prev 0/035ADBC0, desc: PREPARE gid
lock: 2023-10-09 12:43:13.798048 UTC
rmgr: Heap        len (rec/tot):     59/    59, tx:        974, lsn: 0/035ADCA8, prev 0/035ADBF0, desc: INSERT off: 102,
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Transaction len (rec/tot):     34/    34, tx:        974, lsn: 0/035ADCE8, prev 0/035ADCA8, desc: COMMIT
2023-10-09 12:43:13.798434 UTC
rmgr: Transaction len (rec/tot):     34/    34, tx:        975, lsn: 0/035ADD10, prev 0/035ADCE8, desc: COMMIT
2023-10-09 12:43:13.900046 UTC
rmgr: Heap2       len (rec/tot):    255/   255, tx:          0, lsn: 0/035ADD38, prev 0/035ADD10, desc: PRUNE
snapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101],
unused: [], blkref #0: rel 1663/16385/16489 blk 0
rmgr: Heap2       len (rec/tot):    248/   248, tx:          0, lsn: 0/035ADE38, prev 0/035ADD38, desc: VACUUM nunused:
100, unused: [2, 3, 4, ... , 99, 100, 101], blkref #0: rel 1663/16385/16489 blk 0

So it looks like "SELECT * FROM test_recovery_conflict_table2" in this
case was interrupted due to conflict with replaying PRUNE, not VACUUM,
and as a result, this query hadn't entered "waiting for lock" state.
(I saw also other failures of the test, but probably they have the same
root cause).

Best regards,
Alexander

Attachment Content-Type Size
031_recovery_conflict-mod.patch text/x-patch 1.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2023-10-09 14:11:25 Re: On login trigger: take three
Previous Message Amul Sul 2023-10-09 13:26:58 Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression