048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: 048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup
Date: 2026-06-17 20:00:00
Message-ID: 7685519a-0bf9-4e17-93ca-7e3aa10fa29c@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

A recent failure produced by skink [1] apparently materializes an
interesting race condition:
398/398 recovery - postgresql:recovery/048_vacuum_horizon_floor TIMEOUT        15000.51s   killed by signal 15 SIGTERM

regress_log_048_vacuum_horizon_floor contains:
[05:01:16.990](0.000s) ok 3 - Cursor query returned 7 from second fetch. Expected value 7.
[09:08:43.280](14846.290s) # die: death by signal at
/home/bf/bf-build/skink-master/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.
[09:08:43.281](0.001s) 1..3
death by signal at /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.

048_vacuum_horizon_floor_primary.log contains:
2026-06-14 05:00:51.700 CEST [655012][client backend][1/9:0] LOG: statement: VACUUM (VERBOSE, FREEZE, PARALLEL 0)
vac_horizon_floor_table;
2026-06-14 05:00:51.887 CEST [655012][client backend][1/10:0] INFO: aggressively vacuuming
"test_db.public.vac_horizon_floor_table"
2026-06-14 05:00:53.494 CEST [669376][client backend][5/2:0] LOG: statement: SELECT count(*) >= 1 FROM pg_stat_activity
            WHERE pid = 655012
            AND wait_event = 'BufferCleanup';
2026-06-14 05:01:07.789 CEST [676703][walsender][26/0:0] LOG: received replication command: IDENTIFY_SYSTEM
2026-06-14 05:01:07.789 CEST [676703][walsender][26/0:0] STATEMENT: IDENTIFY_SYSTEM
2026-06-14 05:01:07.992 CEST [676703][walsender][26/0:0] LOG: received replication command: START_REPLICATION 0/03000000
TIMELINE 1
2026-06-14 05:01:07.992 CEST [676703][walsender][26/0:0] STATEMENT: START_REPLICATION 0/03000000 TIMELINE 1
2026-06-14 05:01:11.486 CEST [678834][client backend][6/2:0] LOG: statement: SELECT EXISTS (SELECT * FROM
pg_stat_replication);
2026-06-14 05:01:15.472 CEST [656147][client backend][2/2:0] LOG: statement: FETCH vac_horizon_floor_cursor1
    ;
2026-06-14 05:01:19.594 CEST [681642][client backend][7/2:0] LOG: statement: SELECT index_vacuum_count > 0
        FROM pg_stat_progress_vacuum
        WHERE datname='test_db' AND relid::regclass = 'vac_horizon_floor_table'::regclass;
2026-06-14 05:01:24.183 CEST [656147][client backend][2/2:0] LOG: statement: COMMIT;
2026-06-14 05:01:26.473 CEST [684424][client backend][8/2:0] LOG: statement: SELECT vacuum_count > 0
        FROM pg_stat_all_tables WHERE relname = 'vac_horizon_floor_table';
...
2026-06-14 09:08:40.354 CEST [3155195][client backend][7/2090:0] LOG:  statement: SELECT vacuum_count > 0
        FROM pg_stat_all_tables WHERE relname = 'vac_horizon_floor_table';
2026-06-14 09:08:42.036 CEST [3156594][client backend][8/2090:0] LOG:  statement: SELECT vacuum_count > 0
        FROM pg_stat_all_tables WHERE relname = 'vac_horizon_floor_table';
2026-06-14 09:08:43.285 CEST [636342][postmaster][:0] LOG:  received immediate shutdown request

The next run of the test [2] took about 2 minutes:
 53/398 recovery - postgresql:recovery/048_vacuum_horizon_floor OK               120.90s   3 subtests passed

There was a similar failure in the past year [3]:
 38/267 postgresql:recovery / recovery/043_vacuum_horizon_floor                   ERROR 335.83s   exit status 29

regress_log_043_vacuum_horizon_floor
[16:59:38.183](0.003s) ok 3 - Cursor query returned 1 from second fetch. Expected value 1.
IPC::Run: timeout on timer #2 at /usr/share/perl5/IPC/Run.pm line 3025.
# Postmaster PID for node "primary" is 2872485

But that time it looked more like an ordinary timeout.

I've managed to reproduce such a failure locally, on a slowed down VM,
under Valgrind:
ITERATION 23
Sun Jun 14 01:02:30 PM UTC 2026
# +++ tap check in src/test/recovery +++
t/048_vacuum_horizon_floor.pl .. ok
All tests successful.
Files=1, Tests=3, 131 wallclock secs ( 0.01 usr  0.00 sys + 46.21 cusr  3.12 csys = 49.34 CPU)
Result: PASS
ITERATION 24
Sun Jun 14 01:04:41 PM UTC 2026
# +++ tap check in src/test/recovery +++
t/048_vacuum_horizon_floor.pl .. 3/? # die: IPC::Run: timeout on timer #2 at /usr/share/perl5/IPC/Run.pm line 3361.
# Looks like your test exited with 29 just after 3.
t/048_vacuum_horizon_floor.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)

With this diagnostic addition:
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6743,9 +6743,12 @@ LockBufferForCleanup(Buffer buffer)
                }
                bufHdr->wait_backend_pgprocno = MyProcNumber;
                PinCountWaitBuf = bufHdr;
-               UnlockBufHdrExt(bufHdr, buf_state,
-                                               BM_PIN_COUNT_WAITER, 0,
-                                               0);
+for (volatile int i = 0; i < 10000000; i++);
+               buf_state = UnlockBufHdrExt(bufHdr, buf_state,
+ BM_PIN_COUNT_WAITER, 0,
+                                                               0);
+               if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+                       elog(LOG, "!!!LockBufferForCleanup| wakeup lost");
                LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

                /* Wait to be signaled by UnpinBuffer() */
@@ -6794,7 +6797,11 @@ LockBufferForCleanup(Buffer buffer)
                        SetStartupBufferPinWaitBufId(-1);
                }
                else
+{
+elog(LOG, "!!!LockBufferForCleanup]| before ProcWaitForSignal");
ProcWaitForSignal(WAIT_EVENT_BUFFER_CLEANUP);
+elog(LOG, "!!!LockBufferForCleanup]| after ProcWaitForSignal");
+}

                /*
                 * Remove flag marking us as waiter. Normally this will not be set

(The delay before UnlockBufHdrExt() helps reproducing for me --- the test
fails within 10 iterations.)

I could see:
# +++ tap check in src/test/recovery +++
t/048_vacuum_horizon_floor.pl .. 3/? # die: IPC::Run: timeout on timer #2 at /usr/share/perl5/IPC/Run.pm line 3156.
# Looks like your test exited with 29 just after 3.
t/048_vacuum_horizon_floor.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
All 3 subtests passed

Test Summary Report
-------------------
t/048_vacuum_horizon_floor.pl (Wstat: 7424 (exited 29) Tests: 3 Failed: 0)
  Non-zero exit status: 29
Files=1, Tests=3, 318 wallclock secs ( 0.01 usr  0.00 sys + 93.12 cusr  4.73 csys = 97.86 CPU)

048_vacuum_horizon_floor_primary.log contains:
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] LOG:  !!!LockBufferForCleanup| wakeup lost
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] CONTEXT:  while scanning block 26 of relation
"public.vac_horizon_floor_table"
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] STATEMENT:  VACUUM (VERBOSE, FREEZE, PARALLEL
0) vac_horizon_floor_table;
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] LOG:  !!!LockBufferForCleanup]| before
ProcWaitForSignal
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] CONTEXT:  while scanning block 26 of relation
"public.vac_horizon_floor_table"
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] STATEMENT:  VACUUM (VERBOSE, FREEZE, PARALLEL
0) vac_horizon_floor_table;
...
2026-06-17 19:04:36.835 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] LOG:  !!!LockBufferForCleanup]| after
ProcWaitForSignal
2026-06-17 19:04:36.835 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] CONTEXT:  while scanning block 26 of relation
"public.vac_horizon_floor_table"
2026-06-17 19:04:36.835 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] STATEMENT:  VACUUM (VERBOSE, FREEZE, PARALLEL
0) vac_horizon_floor_table;

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2026-06-14%2001%3A21%3A38
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2026-06-14%2007%3A22%3A09
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-05%2015%3A43%3A53

Best regards,
Alexander

Browse pgsql-hackers by date

  From Date Subject
Next Message Mihail Nikalayeu 2026-06-17 20:10:22 Re: Adding REPACK [concurrently]
Previous Message Baji Shaik 2026-06-17 19:55:01 Re: [PATCH] Doc: document standard_conforming_strings dump/restore incompatibility