| From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
|---|---|
| To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | 048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup |
| Date: | 2026-06-17 20:00:00 |
| Message-ID: | 7685519a-0bf9-4e17-93ca-7e3aa10fa29c@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello hackers,
A recent failure produced by skink [1] apparently materializes an
interesting race condition:
398/398 recovery - postgresql:recovery/048_vacuum_horizon_floor TIMEOUT 15000.51s killed by signal 15 SIGTERM
regress_log_048_vacuum_horizon_floor contains:
[05:01:16.990](0.000s) ok 3 - Cursor query returned 7 from second fetch. Expected value 7.
[09:08:43.280](14846.290s) # die: death by signal at
/home/bf/bf-build/skink-master/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.
[09:08:43.281](0.001s) 1..3
death by signal at /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.
048_vacuum_horizon_floor_primary.log contains:
2026-06-14 05:00:51.700 CEST [655012][client backend][1/9:0] LOG: statement: VACUUM (VERBOSE, FREEZE, PARALLEL 0)
vac_horizon_floor_table;
2026-06-14 05:00:51.887 CEST [655012][client backend][1/10:0] INFO: aggressively vacuuming
"test_db.public.vac_horizon_floor_table"
2026-06-14 05:00:53.494 CEST [669376][client backend][5/2:0] LOG: statement: SELECT count(*) >= 1 FROM pg_stat_activity
WHERE pid = 655012
AND wait_event = 'BufferCleanup';
2026-06-14 05:01:07.789 CEST [676703][walsender][26/0:0] LOG: received replication command: IDENTIFY_SYSTEM
2026-06-14 05:01:07.789 CEST [676703][walsender][26/0:0] STATEMENT: IDENTIFY_SYSTEM
2026-06-14 05:01:07.992 CEST [676703][walsender][26/0:0] LOG: received replication command: START_REPLICATION 0/03000000
TIMELINE 1
2026-06-14 05:01:07.992 CEST [676703][walsender][26/0:0] STATEMENT: START_REPLICATION 0/03000000 TIMELINE 1
2026-06-14 05:01:11.486 CEST [678834][client backend][6/2:0] LOG: statement: SELECT EXISTS (SELECT * FROM
pg_stat_replication);
2026-06-14 05:01:15.472 CEST [656147][client backend][2/2:0] LOG: statement: FETCH vac_horizon_floor_cursor1
;
2026-06-14 05:01:19.594 CEST [681642][client backend][7/2:0] LOG: statement: SELECT index_vacuum_count > 0
FROM pg_stat_progress_vacuum
WHERE datname='test_db' AND relid::regclass = 'vac_horizon_floor_table'::regclass;
2026-06-14 05:01:24.183 CEST [656147][client backend][2/2:0] LOG: statement: COMMIT;
2026-06-14 05:01:26.473 CEST [684424][client backend][8/2:0] LOG: statement: SELECT vacuum_count > 0
FROM pg_stat_all_tables WHERE relname = 'vac_horizon_floor_table';
...
2026-06-14 09:08:40.354 CEST [3155195][client backend][7/2090:0] LOG: statement: SELECT vacuum_count > 0
FROM pg_stat_all_tables WHERE relname = 'vac_horizon_floor_table';
2026-06-14 09:08:42.036 CEST [3156594][client backend][8/2090:0] LOG: statement: SELECT vacuum_count > 0
FROM pg_stat_all_tables WHERE relname = 'vac_horizon_floor_table';
2026-06-14 09:08:43.285 CEST [636342][postmaster][:0] LOG: received immediate shutdown request
The next run of the test [2] took about 2 minutes:
53/398 recovery - postgresql:recovery/048_vacuum_horizon_floor OK 120.90s 3 subtests passed
There was a similar failure in the past year [3]:
38/267 postgresql:recovery / recovery/043_vacuum_horizon_floor ERROR 335.83s exit status 29
regress_log_043_vacuum_horizon_floor
[16:59:38.183](0.003s) ok 3 - Cursor query returned 1 from second fetch. Expected value 1.
IPC::Run: timeout on timer #2 at /usr/share/perl5/IPC/Run.pm line 3025.
# Postmaster PID for node "primary" is 2872485
But that time it looked more like an ordinary timeout.
I've managed to reproduce such a failure locally, on a slowed down VM,
under Valgrind:
ITERATION 23
Sun Jun 14 01:02:30 PM UTC 2026
# +++ tap check in src/test/recovery +++
t/048_vacuum_horizon_floor.pl .. ok
All tests successful.
Files=1, Tests=3, 131 wallclock secs ( 0.01 usr 0.00 sys + 46.21 cusr 3.12 csys = 49.34 CPU)
Result: PASS
ITERATION 24
Sun Jun 14 01:04:41 PM UTC 2026
# +++ tap check in src/test/recovery +++
t/048_vacuum_horizon_floor.pl .. 3/? # die: IPC::Run: timeout on timer #2 at /usr/share/perl5/IPC/Run.pm line 3361.
# Looks like your test exited with 29 just after 3.
t/048_vacuum_horizon_floor.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
With this diagnostic addition:
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6743,9 +6743,12 @@ LockBufferForCleanup(Buffer buffer)
}
bufHdr->wait_backend_pgprocno = MyProcNumber;
PinCountWaitBuf = bufHdr;
- UnlockBufHdrExt(bufHdr, buf_state,
- BM_PIN_COUNT_WAITER, 0,
- 0);
+for (volatile int i = 0; i < 10000000; i++);
+ buf_state = UnlockBufHdrExt(bufHdr, buf_state,
+ BM_PIN_COUNT_WAITER, 0,
+ 0);
+ if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+ elog(LOG, "!!!LockBufferForCleanup| wakeup lost");
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
/* Wait to be signaled by UnpinBuffer() */
@@ -6794,7 +6797,11 @@ LockBufferForCleanup(Buffer buffer)
SetStartupBufferPinWaitBufId(-1);
}
else
+{
+elog(LOG, "!!!LockBufferForCleanup]| before ProcWaitForSignal");
ProcWaitForSignal(WAIT_EVENT_BUFFER_CLEANUP);
+elog(LOG, "!!!LockBufferForCleanup]| after ProcWaitForSignal");
+}
/*
* Remove flag marking us as waiter. Normally this will not be set
(The delay before UnlockBufHdrExt() helps reproducing for me --- the test
fails within 10 iterations.)
I could see:
# +++ tap check in src/test/recovery +++
t/048_vacuum_horizon_floor.pl .. 3/? # die: IPC::Run: timeout on timer #2 at /usr/share/perl5/IPC/Run.pm line 3156.
# Looks like your test exited with 29 just after 3.
t/048_vacuum_horizon_floor.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
All 3 subtests passed
Test Summary Report
-------------------
t/048_vacuum_horizon_floor.pl (Wstat: 7424 (exited 29) Tests: 3 Failed: 0)
Non-zero exit status: 29
Files=1, Tests=3, 318 wallclock secs ( 0.01 usr 0.00 sys + 93.12 cusr 4.73 csys = 97.86 CPU)
048_vacuum_horizon_floor_primary.log contains:
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] LOG: !!!LockBufferForCleanup| wakeup lost
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] CONTEXT: while scanning block 26 of relation
"public.vac_horizon_floor_table"
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] STATEMENT: VACUUM (VERBOSE, FREEZE, PARALLEL
0) vac_horizon_floor_table;
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] LOG: !!!LockBufferForCleanup]| before
ProcWaitForSignal
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] CONTEXT: while scanning block 26 of relation
"public.vac_horizon_floor_table"
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] STATEMENT: VACUUM (VERBOSE, FREEZE, PARALLEL
0) vac_horizon_floor_table;
...
2026-06-17 19:04:36.835 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] LOG: !!!LockBufferForCleanup]| after
ProcWaitForSignal
2026-06-17 19:04:36.835 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] CONTEXT: while scanning block 26 of relation
"public.vac_horizon_floor_table"
2026-06-17 19:04:36.835 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] STATEMENT: VACUUM (VERBOSE, FREEZE, PARALLEL
0) vac_horizon_floor_table;
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2026-06-14%2001%3A21%3A38
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2026-06-14%2007%3A22%3A09
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-05%2015%3A43%3A53
Best regards,
Alexander
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Mihail Nikalayeu | 2026-06-17 20:10:22 | Re: Adding REPACK [concurrently] |
| Previous Message | Baji Shaik | 2026-06-17 19:55:01 | Re: [PATCH] Doc: document standard_conforming_strings dump/restore incompatibility |