From: | Shayon Mukherjee <shayonj(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch |
Date: | 2025-05-06 22:05:42 |
Message-ID: | CANqtF-qDGYhYDcpg3PEeDrXMmuJZJGTAeT0mJx0KrN+kVikZig@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello hackers,
I'd like to discuss an issue we observed where the truncation phase of
autovacuum can become significantly prolonged on toast tables of busy
systems due to repeated interruptions by lock waiters, leading to extended
`AccessExclusiveLock` contention and impacting overall database &
application availability.
*Recap of the production incident*
We experienced this issue directly where a `VACUUM` on a TOAST table took
approximately 5 hours. The following shows the scale of pages eventually
removed and the duration:
```
automatic vacuum of table "tines.pg_toast.pg_toast_21059": index scans: 1
pages: 26096516 removed, 1421195 remain, 0 skipped due to pins, 26090236
skipped frozen
tuples: 4575083 removed, 26717710 remain, 21498 are dead but not yet
removable, oldest xmin: 173810717
index scan needed: 1151751 pages from table (4.19% of total) had 4629490
dead item identifiers removed
index "pg_toast_21059_index": pages: 303420 in total, 14038 newly deleted,
298668 currently deleted, 298418 reusable
I/O timings: read: 18428758.270 ms, write: 0.000 ms
avg read rate: 10.730 MB/s, avg write rate: 0.000 MB/s
buffer usage: 5132686 hits, 25714763 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 327.79 s, system: 251.11 s, elapsed: 18722.21 s
```
During this 5-hour window, read queries against primary and replica
databases (on AWS Aurora) were frequently failing due to lock timeouts
(lock_timeout was set to 1s), consistent with the `AccessExclusiveLock`
being held and released repeatedly by the truncation part of the autovacuum
process.
This then further led to me getting a better understanding of the problem
and order of events. Which brings me to:
*Understanding the problem*
My understanding is that the PostgreSQL autovacuum process, specifically
its lazy variant (`lazy_vacuum_heap`), performs few passes to reclaim
space. One critical phase, if deemed beneficial, is the truncation of empty
pages from the end of a relation, handled by the `lazy_truncate_heap`
function.
As I traced down the code, the interaction occurs as follows:
- To physically truncate a relation, `VACUUM` must acquire an
`AccessExclusiveLock`.
- After acquiring the `AccessExclusiveLock`, `VACUUM` performs a
verification step by scanning the relation backwards from its apparent end
(`count_nondeletable_pages`). This is necessary because other transactions
might have added data to these trailing pages.
- While holding the `AccessExclusiveLock` and performing this backward
scan, `VACUUM` periodically checks if any other database sessions are
waiting for locks on the same relation using
`LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock)`. This check
occurs at intervals defined by `VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL` (20ms)
and roughly every 32 blocks scanned.
- Interruption and Retry Mechanism:
- If `LockHasWaitersRelation` returns `true` during the backward scan,
`count_nondeletable_pages` sets a flag (`lock_waiter_detected = true`) and
returns, effectively "suspending" its scan to allow waiting sessions to
proceed.
- Back in `lazy_truncate_heap`, if `lock_waiter_detected` is true,
`VACUUM` releases the `AccessExclusiveLock`.
- `lazy_truncate_heap` contains a `do...while` loop. If
`lock_waiter_detected` was true AND there are still potentially truncatable
pages (`new_rel_pages > vacrel->nonempty_pages`), this loop causes `VACUUM`
to retry the entire truncation sequence: attempt to re-acquire the
`AccessExclusiveLock` (with its own timeout of
`VACUUM_TRUNCATE_LOCK_TIMEOUT`, default 5s), restart the
`count_nondeletable_pages` backward scan, and so on.
- Next comes the "stuck in a loop" problem and this is where things get
interesting:
- On a very busy system with constant concurrent access, `VACUUM` can
repeatedly acquire the `AccessExclusiveLock`, start the backward scan,
detect waiters, release the lock, and retry. Each retry of
`count_nondeletable_pages` involves re-reading potentially many blocks from
disk while holding the exclusive lock which aggravates the lock contention.
- This cycle leads to prolonged `VACUUM` operations, significant
repeated I/O, and extended lock contention, reducing database availability
and causing query timeouts for other sessions (e.g., `canceling statement
due to lock timeout`, `LockNotAvailable`). This is particularly acute for
the TOAST tables in my case which sadly happens to be very large (for now).
*Testing the theory and patch*
To validate this understanding, I was able to reproduce a similar scenario
by loading a test table with a significant amount of TOAST-able data and
then simulating a busy read workload against it concurrently with a
`VACUUM` operation. I have attached a sql script to load the data and a
ruby script to simulate the concurrent workload if anyone is curious. The
tests demonstrated how the truncation phase could indeed enter extended
periods of repeatedly attempting to acquire the `AccessExclusiveLock` and
performing the backward scan, only to be interrupted by the simulated
concurrent workload.
Considering the potential for the truncation phase to cause prolonged lock
contention under these conditions, I am wondering if it would be beneficial
to explicitly cap the number of retries for the truncation process within
the `do...while` loop in `lazy_truncate_heap` when interruptions by lock
waiters occur?
Because I was already too deep in the code, I decided to write a small
patch to introduce a limit to the retries and observe the behavior.
Specifically, it adds a counter `truncate_interruption_retry_count` and
checks it against a new constant `VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES`
(which for now I have set to 3). If the backward scan in
`count_nondeletable_pages` is interrupted by lock waiters this many times,
`VACUUM` will cease further truncation attempts for the current `VACUUM`
cycle on that table.
I was able to perform the test on a similar setup and the results are as
follows (noticed the `stopping truncate after 3 retries due to repeated
conflicting lock requests` coming from the patch):
```
VACUUM (VERBOSE) toast_target_table;
INFO: vacuuming "postgres.public.toast_target_table"
INFO: table "toast_target_table": suspending truncate due to conflicting
lock request
INFO: table "toast_target_table": truncated 144032 to 143744 pages
INFO: table "toast_target_table": suspending truncate due to conflicting
lock request
INFO: table "toast_target_table": truncated 143744 to 143456 pages
INFO: table "toast_target_table": suspending truncate due to conflicting
lock request
INFO: table "toast_target_table": truncated 143456 to 143168 pages
INFO: table "toast_target_table": stopping truncate after 3 retries due to
repeated conflicting lock requests
INFO: finished vacuuming "postgres.public.toast_target_table": index
scans: 0
pages: 864 removed, 143168 remain, 1 scanned (0.00% of total), 0 eagerly
scanned
tuples: 0 removed, 7544 remain, 0 are dead but not yet removable
removable cutoff: 2985896, which was 1 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were
all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
identifiers removed
index "toast_target_table_pkey": pages: 28132 in total, 0 newly deleted,
27646 currently deleted, 27646 reusable
avg read rate: 13.423 MB/s, avg write rate: 0.006 MB/s
buffer usage: 28172 hits, 28571 reads, 12 dirtied
WAL usage: 22 records, 15 full page images, 120289 bytes, 0 buffers full
system usage: CPU: user: 0.13 s, system: 3.55 s, elapsed: 16.62 s
INFO: vacuuming "postgres.pg_toast.pg_toast_649469"
INFO: table "pg_toast_649469": suspending truncate due to conflicting lock
request
INFO: table "pg_toast_649469": truncated 2553888 to 2552416 pages
INFO: table "pg_toast_649469": suspending truncate due to conflicting lock
request
INFO: table "pg_toast_649469": truncated 2552416 to 2550560 pages
INFO: table "pg_toast_649469": suspending truncate due to conflicting lock
request
INFO: table "pg_toast_649469": truncated 2550560 to 2548832 pages
INFO: table "pg_toast_649469": stopping truncate after 3 retries due to
repeated conflicting lock requests
INFO: finished vacuuming "postgres.pg_toast.pg_toast_649469": index scans:
0
pages: 5056 removed, 2548832 remain, 1 scanned (0.00% of total), 0 eagerly
scanned
tuples: 0 removed, 14008 remain, 0 are dead but not yet removable
removable cutoff: 2985897, which was 1 XIDs old when operation ended
new relfrozenxid: 2985897, which is 2 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were
all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
identifiers removed
index "pg_toast_649469_index": pages: 56259 in total, 0 newly deleted,
55293 currently deleted, 55293 reusable
avg read rate: 5.067 MB/s, avg write rate: 0.002 MB/s
buffer usage: 56371 hits, 61075 reads, 19 dirtied
WAL usage: 28 records, 20 full page images, 165414 bytes, 0 buffers full
system usage: CPU: user: 1.67 s, system: 59.10 s, elapsed: 94.16 s
```
All that said, I wanted to bring some visibility into this part of the
problem and also learn from the community on potential ways we could solve
it. One way to solve this is capping the number of retries like in the
patch I attached. The reason I feel like something like this is beneficial
is because it would prioritize overall system availability by preventing
lazy truncate from acquiring the `AccessExclusiveLock` and holding it for
an extended period of time.
The patch is definitely more of a proof of concept for now (it lacks some
tests and maybe even some configurable params?) and more importantly I am
also curious to hear from the community if this feels like a worthwhile
change or approach (assuming my understanding of the problem makes sense).
Or perhaps there are other ways to solve this problem as well?
Thank you
Shayon Mukherjee
Attachment | Content-Type | Size |
---|---|---|
script.rb | text/x-ruby-script | 3.8 KB |
v18-0001-Limit-VACUUM-truncation-retries-after-scan-inter.patch | application/octet-stream | 2.4 KB |
load-data.sql | application/octet-stream | 2.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tatsuo Ishii | 2025-05-06 22:28:02 | Re: regdatabase |
Previous Message | Peter Geoghegan | 2025-05-06 22:00:37 | Re: Adding skip scan (including MDAM style range skip scan) to nbtree |