Test 031_recovery_conflict.pl is not immune to autovacuum

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Test 031_recovery_conflict.pl is not immune to autovacuum
Date: 2024-03-20 13:00:00
Message-ID: a4038b8f-2ca1-1bc6-99e3-d00d99ee22c6@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

Among many recoveryCheck (more concretely, 027_stream_regress) failures
occurred on a number of buildfarm animals after switching to meson, which
can be explained by timeouts, I saw a different failure on adder:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-03-18%2023%3A43%3A00
[23:48:52.521](9.831s) ok 13 - startup deadlock: cursor holding conflicting pin, also waiting for lock, established
[23:55:13.749](381.228s) # poll_query_until timed out executing this query:
#
# SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
#
# expecting this output:
# waiting
# last actual query output:
#
# with stderr:
[23:55:13.763](0.013s) not ok 14 - startup deadlock: lock acquisition is waiting

and I suspect that it might be caused by autovacuum.

I've managed to reproduced it locally (running 10 tests in parallel on a
2-core VM with disk bandwidth limited to 80MB/sec I get failures on
iterations 10, 1, 3) and observed the following (with wal_debug = on):
031_recovery_conflict_standby.log:
2024-03-20 04:12:06.519 UTC|vagrant|test_db|65fa6214.111ede|LOG: statement: DECLARE test_recovery_conflict_cursor CURSOR
FOR SELECT a FROM test_recovery_conflict_table1;
2024-03-20 04:12:06.520 UTC|vagrant|test_db|65fa6214.111ede|LOG: statement: FETCH FORWARD FROM
test_recovery_conflict_cursor;
2024-03-20 04:12:06.520 UTC|vagrant|test_db|65fa6214.111ede|LOG: statement: SELECT * FROM test_recovery_conflict_table2;
...
2024-03-20 04:12:07.073 UTC|||65fa620d.111ec8|LOG:  REDO @ 0/3438360; LSN 0/3438460: prev 0/3438338; xid 0; len 9;
blkref #0: rel 1663/16385/16392, blk 0 - Heap2/PRUNE: snapshotConflictHorizon: 0, nredirected: 0, ndead: 0,
isCatalogRel: F, nunused: 100, redirected: [], dead: [], unused: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,
47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101]
2024-03-20 04:12:07.084 UTC|||65fa620d.111ec8|LOG:  recovery still waiting after 11.241 ms: recovery conflict on buffer pin
2024-03-20 04:12:07.084 UTC|||65fa620d.111ec8|CONTEXT:  WAL redo at 0/3438360 for Heap2/PRUNE: snapshotConflictHorizon:
0, nredirected: 0, ndead: 0, isCatalogRel: F, nunused: 100, redirected: [], dead: [], unused: [2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,
70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,
100, 101]; blkref #0: rel 1663/16385/16392, blk 0
2024-03-20 04:12:07.095 UTC|vagrant|test_db|65fa6214.111ede|ERROR: canceling statement due to conflict with recovery at
character 15
2024-03-20 04:12:07.095 UTC|vagrant|test_db|65fa6214.111ede|DETAIL: User transaction caused buffer deadlock with recovery.
...
2024-03-20 04:12:08.093 UTC|vagrant|postgres|65fa6216.111f1a|LOG: statement: SELECT 'waiting' FROM pg_locks WHERE
locktype = 'relation' AND NOT granted;

031_recovery_conflict_primary.log:
2024-03-20 04:12:05.980 UTC|||65fa6215.111f02|DEBUG:  Autovacuum VacuumUpdateCosts(db=16385, rel=16392, dobalance=yes,
cost_limit=200, cost_delay=2 active=yes failsafe=no)
2024-03-20 04:12:05.980 UTC|||65fa6215.111f02|DEBUG:  Autovacuum VacuumUpdateCosts(db=16385, rel=16392, dobalance=yes,
cost_limit=200, cost_delay=2 active=yes failsafe=no)
2024-03-20 04:12:05.980 UTC|||65fa6215.111f02|LOG:  INSERT @ 0/3438460:  - Heap2/PRUNE: snapshotConflictHorizon: 0,
nredirected: 0, ndead: 0, isCatalogRel: F, nunused: 100, redirected: [], dead: [], unused: [2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100,
101]
2024-03-20 04:12:05.980 UTC|||65fa6215.111f02|CONTEXT:  while scanning block 0 of relation
"public.test_recovery_conflict_table1"
...
2024-03-20 04:12:05.981 UTC|||65fa6215.111f02|LOG:  automatic vacuum of table
"test_db.public.test_recovery_conflict_table1": index scans: 0

The corresponding fragment of 031_recovery_conflict.pl:
$res = $psql_standby->query_until(
    qr/^1$/m, qq[
    BEGIN;
    -- hold pin
    DECLARE $cursor1 CURSOR FOR SELECT a FROM $table1;
    FETCH FORWARD FROM $cursor1;
    -- wait for lock held by prepared transaction
    SELECT * FROM $table2;
    ]);
ok(1,
    "$sect: cursor holding conflicting pin, also waiting for lock, established"
);

# just to make sure we're waiting for lock already
ok( $node_standby->poll_query_until(
        'postgres', qq[
SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
], 'waiting'),
    "$sect: lock acquisition is waiting");

# VACUUM FREEZE will prune away rows, causing a buffer pin conflict, while
# standby psql is waiting on lock
$node_primary->safe_psql($test_db, qq[VACUUM FREEZE $table1;]);

So if autovacuum happens to process "$table1" before SELECT ... FROM
pg_locks, a buffer pin conflict occurs before the manual VACUUM FREEZE
and poll_query_until() fails.

With autovacuum = off in TEMP_CONFIG 50 iterations passed for me in
the same environment.

Best regards,
Alexander

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Sabino Mullane 2024-03-20 13:04:21 Re: Possibility to disable `ALTER SYSTEM`
Previous Message Amit Langote 2024-03-20 12:53:52 Re: remaining sql/json patches