Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

From: Shayon Mukherjee <shayonj(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch
Date: 2025-05-07 13:33:25
Message-ID: CANqtF-rSHv4z9N5PczB7hsmZAO3=J3u7KPTc+-BWe4UUP5p=5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Attached are the updated files to help reproduce the state mentioned in the
previous email more easily.

- The load_data.sql has the commands to insert a lot of dummy data that
will overflow to TOAST
- Which then is deleted (via the commands in the sql file) to create a
situation where a truncate will happen on a future vacuum
- Updated ruby script.rb to make simple select calls with high concurrency
- Renamed the POC to be `v1`

Thank you
Shayon

On Tue, May 6, 2025 at 6:05 PM Shayon Mukherjee <shayonj(at)gmail(dot)com> wrote:

> Hello hackers,
>
> I'd like to discuss an issue we observed where the truncation phase of
> autovacuum can become significantly prolonged on toast tables of busy
> systems due to repeated interruptions by lock waiters, leading to extended
> `AccessExclusiveLock` contention and impacting overall database &
> application availability.
>
>
> *Recap of the production incident*
> We experienced this issue directly where a `VACUUM` on a TOAST table took
> approximately 5 hours. The following shows the scale of pages eventually
> removed and the duration:
>
> ```
> automatic vacuum of table "tines.pg_toast.pg_toast_21059": index scans: 1
> pages: 26096516 removed, 1421195 remain, 0 skipped due to pins, 26090236
> skipped frozen
> tuples: 4575083 removed, 26717710 remain, 21498 are dead but not yet
> removable, oldest xmin: 173810717
> index scan needed: 1151751 pages from table (4.19% of total) had 4629490
> dead item identifiers removed
> index "pg_toast_21059_index": pages: 303420 in total, 14038 newly deleted,
> 298668 currently deleted, 298418 reusable
> I/O timings: read: 18428758.270 ms, write: 0.000 ms
> avg read rate: 10.730 MB/s, avg write rate: 0.000 MB/s
> buffer usage: 5132686 hits, 25714763 misses, 0 dirtied
> WAL usage: 0 records, 0 full page images, 0 bytes
> system usage: CPU: user: 327.79 s, system: 251.11 s, elapsed: 18722.21 s
> ```
>
> During this 5-hour window, read queries against primary and replica
> databases (on AWS Aurora) were frequently failing due to lock timeouts
> (lock_timeout was set to 1s), consistent with the `AccessExclusiveLock`
> being held and released repeatedly by the truncation part of the autovacuum
> process.
>
> This then further led to me getting a better understanding of the problem
> and order of events. Which brings me to:
>
> *Understanding the problem*
> My understanding is that the PostgreSQL autovacuum process, specifically
> its lazy variant (`lazy_vacuum_heap`), performs few passes to reclaim
> space. One critical phase, if deemed beneficial, is the truncation of empty
> pages from the end of a relation, handled by the `lazy_truncate_heap`
> function.
>
> As I traced down the code, the interaction occurs as follows:
>
> - To physically truncate a relation, `VACUUM` must acquire an
> `AccessExclusiveLock`.
> - After acquiring the `AccessExclusiveLock`, `VACUUM` performs a
> verification step by scanning the relation backwards from its apparent end
> (`count_nondeletable_pages`). This is necessary because other transactions
> might have added data to these trailing pages.
> - While holding the `AccessExclusiveLock` and performing this backward
> scan, `VACUUM` periodically checks if any other database sessions are
> waiting for locks on the same relation using
> `LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock)`. This check
> occurs at intervals defined by `VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL` (20ms)
> and roughly every 32 blocks scanned.
> - Interruption and Retry Mechanism:
> - If `LockHasWaitersRelation` returns `true` during the backward scan,
> `count_nondeletable_pages` sets a flag (`lock_waiter_detected = true`) and
> returns, effectively "suspending" its scan to allow waiting sessions to
> proceed.
> - Back in `lazy_truncate_heap`, if `lock_waiter_detected` is true,
> `VACUUM` releases the `AccessExclusiveLock`.
> - `lazy_truncate_heap` contains a `do...while` loop. If
> `lock_waiter_detected` was true AND there are still potentially truncatable
> pages (`new_rel_pages > vacrel->nonempty_pages`), this loop causes `VACUUM`
> to retry the entire truncation sequence: attempt to re-acquire the
> `AccessExclusiveLock` (with its own timeout of
> `VACUUM_TRUNCATE_LOCK_TIMEOUT`, default 5s), restart the
> `count_nondeletable_pages` backward scan, and so on.
>
> - Next comes the "stuck in a loop" problem and this is where things get
> interesting:
> - On a very busy system with constant concurrent access, `VACUUM` can
> repeatedly acquire the `AccessExclusiveLock`, start the backward scan,
> detect waiters, release the lock, and retry. Each retry of
> `count_nondeletable_pages` involves re-reading potentially many blocks from
> disk while holding the exclusive lock which aggravates the lock contention.
> - This cycle leads to prolonged `VACUUM` operations, significant
> repeated I/O, and extended lock contention, reducing database availability
> and causing query timeouts for other sessions (e.g., `canceling statement
> due to lock timeout`, `LockNotAvailable`). This is particularly acute for
> the TOAST tables in my case which sadly happens to be very large (for now).
>
>
> *Testing the theory and patch*
> To validate this understanding, I was able to reproduce a similar scenario
> by loading a test table with a significant amount of TOAST-able data and
> then simulating a busy read workload against it concurrently with a
> `VACUUM` operation. I have attached a sql script to load the data and a
> ruby script to simulate the concurrent workload if anyone is curious. The
> tests demonstrated how the truncation phase could indeed enter extended
> periods of repeatedly attempting to acquire the `AccessExclusiveLock` and
> performing the backward scan, only to be interrupted by the simulated
> concurrent workload.
>
> Considering the potential for the truncation phase to cause prolonged lock
> contention under these conditions, I am wondering if it would be beneficial
> to explicitly cap the number of retries for the truncation process within
> the `do...while` loop in `lazy_truncate_heap` when interruptions by lock
> waiters occur?
>
> Because I was already too deep in the code, I decided to write a small
> patch to introduce a limit to the retries and observe the behavior.
> Specifically, it adds a counter `truncate_interruption_retry_count` and
> checks it against a new constant `VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES`
> (which for now I have set to 3). If the backward scan in
> `count_nondeletable_pages` is interrupted by lock waiters this many times,
> `VACUUM` will cease further truncation attempts for the current `VACUUM`
> cycle on that table.
>
> I was able to perform the test on a similar setup and the results are as
> follows (noticed the `stopping truncate after 3 retries due to repeated
> conflicting lock requests` coming from the patch):
>
> ```
> VACUUM (VERBOSE) toast_target_table;
> INFO: vacuuming "postgres.public.toast_target_table"
> INFO: table "toast_target_table": suspending truncate due to conflicting
> lock request
> INFO: table "toast_target_table": truncated 144032 to 143744 pages
> INFO: table "toast_target_table": suspending truncate due to conflicting
> lock request
> INFO: table "toast_target_table": truncated 143744 to 143456 pages
> INFO: table "toast_target_table": suspending truncate due to conflicting
> lock request
> INFO: table "toast_target_table": truncated 143456 to 143168 pages
> INFO: table "toast_target_table": stopping truncate after 3 retries due
> to repeated conflicting lock requests
> INFO: finished vacuuming "postgres.public.toast_target_table": index
> scans: 0
> pages: 864 removed, 143168 remain, 1 scanned (0.00% of total), 0 eagerly
> scanned
> tuples: 0 removed, 7544 remain, 0 are dead but not yet removable
> removable cutoff: 2985896, which was 1 XIDs old when operation ended
> frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
> visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were
> all-visible)
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
> identifiers removed
> index "toast_target_table_pkey": pages: 28132 in total, 0 newly deleted,
> 27646 currently deleted, 27646 reusable
> avg read rate: 13.423 MB/s, avg write rate: 0.006 MB/s
> buffer usage: 28172 hits, 28571 reads, 12 dirtied
> WAL usage: 22 records, 15 full page images, 120289 bytes, 0 buffers full
> system usage: CPU: user: 0.13 s, system: 3.55 s, elapsed: 16.62 s
> INFO: vacuuming "postgres.pg_toast.pg_toast_649469"
> INFO: table "pg_toast_649469": suspending truncate due to conflicting
> lock request
> INFO: table "pg_toast_649469": truncated 2553888 to 2552416 pages
> INFO: table "pg_toast_649469": suspending truncate due to conflicting
> lock request
> INFO: table "pg_toast_649469": truncated 2552416 to 2550560 pages
> INFO: table "pg_toast_649469": suspending truncate due to conflicting
> lock request
> INFO: table "pg_toast_649469": truncated 2550560 to 2548832 pages
> INFO: table "pg_toast_649469": stopping truncate after 3 retries due to
> repeated conflicting lock requests
> INFO: finished vacuuming "postgres.pg_toast.pg_toast_649469": index
> scans: 0
> pages: 5056 removed, 2548832 remain, 1 scanned (0.00% of total), 0 eagerly
> scanned
> tuples: 0 removed, 14008 remain, 0 are dead but not yet removable
> removable cutoff: 2985897, which was 1 XIDs old when operation ended
> new relfrozenxid: 2985897, which is 2 XIDs ahead of previous value
> frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
> visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were
> all-visible)
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
> identifiers removed
> index "pg_toast_649469_index": pages: 56259 in total, 0 newly deleted,
> 55293 currently deleted, 55293 reusable
> avg read rate: 5.067 MB/s, avg write rate: 0.002 MB/s
> buffer usage: 56371 hits, 61075 reads, 19 dirtied
> WAL usage: 28 records, 20 full page images, 165414 bytes, 0 buffers full
> system usage: CPU: user: 1.67 s, system: 59.10 s, elapsed: 94.16 s
> ```
>
> All that said, I wanted to bring some visibility into this part of the
> problem and also learn from the community on potential ways we could solve
> it. One way to solve this is capping the number of retries like in the
> patch I attached. The reason I feel like something like this is beneficial
> is because it would prioritize overall system availability by preventing
> lazy truncate from acquiring the `AccessExclusiveLock` and holding it for
> an extended period of time.
>
> The patch is definitely more of a proof of concept for now (it lacks some
> tests and maybe even some configurable params?) and more importantly I am
> also curious to hear from the community if this feels like a worthwhile
> change or approach (assuming my understanding of the problem makes sense).
> Or perhaps there are other ways to solve this problem as well?
>
> Thank you
> Shayon Mukherjee
>

Attachment Content-Type Size
script.rb text/x-ruby-script 3.8 KB
load-data.sql application/sql 2.4 KB
v1-0001-Limit-VACUUM-truncation-retries-after-scan-interr.patch application/x-patch 2.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2025-05-07 13:51:31 Re: regdatabase
Previous Message Yugo Nagata 2025-05-07 13:23:18 Re: Improve tab completion for COPY