Re: POC: Parallel processing of indexes in autovacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
Cc: Daniil Davydov <3danissimo(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Sami Imseih <samimseih(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Matheus Alcantara <matheusssilv97(at)gmail(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: POC: Parallel processing of indexes in autovacuum
Date: 2026-04-01 18:54:26
Message-ID: CAD21AoAwcevLebQO6+MuWoOi9XNrNMg7gubFbLao2QkrRbOfMQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 30, 2026 at 5:14 PM SATYANARAYANA NARLAPURAM
<satyanarlapuram(at)gmail(dot)com> wrote:
>
> Hi
>
> On Mon, Mar 30, 2026 at 1:44 AM Daniil Davydov <3danissimo(at)gmail(dot)com> wrote:
>>
>> Hi,
>>
>> On Mon, Mar 30, 2026 at 7:17 AM SATYANARAYANA NARLAPURAM
>> <satyanarlapuram(at)gmail(dot)com> wrote:
>> >
>> > Thank you for working on this, very useful feature. Sharing a few thoughts:
>> >
>> > 1. Shouldn't we also cap by max_parallel_workers to avoid wasting DSM resources in parallel_vacuum_compute_workers?
>>
>> Actually, autovacuum_max_parallel_workers is already limited by
>> max_parallel_workers. It is not clear for me why we allow setting this GUC
>> higher than max_parallel_workers, but if this happens, I think it is a user's
>> misconfiguration.
>>
>> > 2. Is it intentional that other autovacuum workers not yield cost limits to the parallel auto vacuum workers? Cost limits are distributed first equally to the autovacuum workers.
>> > and then they share that. Therefore, parallel workers will be heavily throttled. IIUC, this problem doesn't exist with manual vacuum.
>> > If we don't fix this, at least we should document this.
>>
>> Parallel a/v workers inherit cost based parameters (including the
>> vacuum_cost_limit) from the leader worker. Do you mean that this can be too
>> low value for parallel operation? If so, user can manually increase the
>> vacuum_cost_limit reloption for those tables, where parallel a/v sleeps too
>> much (due to cost delay).
>>
>> BTW, describing the cost limit propagation to the parallel a/v workers is
>> worth mentioning in the documentation. I'll add it in the next patch version.
>>
>> > 3. Additionally, is there a point where, based on the cost limits, launching additional workers becomes counterproductive compared to running fewer workers and preventing it?
>>
>> I don't think that we can possibly find a universal limit that will be
>> appropriate for all possible configurations. By now we are using a pretty
>> simple formula for parallel degree calculation. Since user have several ways
>> to affect this formula, I guess that there will be no problems with it (except
>> my concerns about opt-out style).
>>
>> > 4. Would it make sense to add a table level override to disable parallelism or set parallel worker count?
>>
>> We already have the "autovacuum_parallel_workers" reloption that is used as
>> an additional limit for the number of parallel workers. In particular, this
>> reloption can be used to disable parallelism at all.
>>
>> >
>> > I ran some perf tests to show the improvements with parallel vacuum and shared below.
>>
>> Thank you very much!
>>
>> > Observations:
>> >
>> > 1. Parallel autovacuum provides consistent speedup. With cost_limit=200 and
>> > 7 workers, vacuum completes 1.41x faster (71s -> 50s). With cost_limit=60,
>> > the speedup is 1.25x (194s -> 154s).
>> > 2. I see the benefit comes from parallelizing index vacuum. With 8 indexes totaling
>> > ~530 MB, parallel workers scan indexes concurrently instead of the leader
>> > scanning them one by one. The leader's CPU user time drops from ~3s to
>> > ~0.8s as index work is offloaded
>> >
>>
>> 1.41 speedup with 7 parallel workers may not seem like a great win, but it is
>> a whole time of autovacuum operation (not only index bulkdel/cleanup) with
>> pretty small indexes.
>>
>> May I ask you to run the same test with a higher table's size (several dozen
>> gigabytes)? I think the results will be more "expressive".
>
>
> I ran it with a Billion rows in a table with 8 indexes. The improvement with 7 workers is 1.8x.
> Please note that there is a fixed overhead in other vacuum steps, for example heap scan.
> In the environments where cost-based delay is used (the default), benefits will be modest
> unless vacuum_cost_delay is set to sufficiently large value.
>
> Hardware:
> CPU: Intel Xeon Platinum 8573C, 1 socket × 8 cores × 2 threads = 16 vCPUs
> RAM: 128 GB (131,900 MB)
> Swap: None
>
> Workload Description
>
> Table Schema:
> CREATE TABLE avtest (
> id bigint PRIMARY KEY,
> col1 int, -- random()*1e9
> col2 int, -- random()*1e9
> col3 int, -- random()*1e9
> col4 int, -- random()*1e9
> col5 int, -- random()*1e9
> col6 text, -- 'text_' || random()*1e6 (short text ~10 chars)
> col7 timestamp, -- now() - random()*365 days
> padding text -- repeat('x', 50)
> ) WITH (fillfactor = 90);
>
> Indexes (8 total):
> avtest_pkey — btree on (id) bigint
> idx_av_col1 — btree on (col1) int
> idx_av_col2 — btree on (col2) int
> idx_av_col3 — btree on (col3) int
> idx_av_col4 — btree on (col4) int
> idx_av_col5 — btree on (col5) int
> idx_av_col6 — btree on (col6) text
> idx_av_col7 — btree on (col7) timestamp
>
> Dead Tuple Generation:
> DELETE FROM avtest WHERE id % 5 IN (1, 2);
> This deletes exactly 40% of rows, uniformly distributed across all pages.
>
> Vacuum Trigger:
> Autovacuum is triggered naturally by lowering the threshold to 0 and setting
> scale_factor to a value that causes immediate launch after the DELETE.
>
> Worker Configurations Tested:
> 0 workers — leader-only vacuum (baseline, no parallelism)
> 2 workers — leader + 2 parallel workers (3 processes total)
> 4 workers — leader + 4 parallel workers (5 processes total)
> 7 workers — leader + 7 parallel workers (8 processes total, 1 per index)
>
> Dataset:
> Rows: 1,000,000,000
> Heap size: 139 GB
> Total size: 279 GB (heap + 8 indexes)
> Dead tuples: 400,000,000 (40%)
>
> Index Sizes:
> avtest_pkey 21 GB (bigint)
> idx_av_col7 21 GB (timestamp)
> idx_av_col1 18 GB (int)
> idx_av_col2 18 GB (int)
> idx_av_col3 18 GB (int)
> idx_av_col4 18 GB (int)
> idx_av_col5 18 GB (int)
> idx_av_col6 7 GB (text — shorter keys, smaller index)
> Total indexes: 139 GB
>
> Server Settings:
> shared_buffers = 96GB
> maintenance_work_mem = 1GB
> max_wal_size = 100GB
> checkpoint_timeout = 1h
> autovacuum_vacuum_cost_delay = 0ms (NO throttling)
> autovacuum_vacuum_cost_limit = 1000
>
>
> Summary:
>
> Workers Avg(s) Min(s) Max(s) Speedup Time Saved
> ------- ------ ------ ------ ------- ----------
> 0 1645.93 1645.01 1646.84 1.00x —
> 2 1276.35 1275.64 1277.05 1.29x 369.58s (6.2 min)
> 4 1052.62 1048.92 1056.32 1.56x 593.31s (9.9 min)
> 7 892.23 886.59 897.86 1.84x 753.70s (12.6 min)
>

Thank you for sharing the performance test results!

While the benchmark results look good to me, have you compared the
performance differences between parallel vacuum in the VACUUM command
(with the PARALLEL option) and parallel vacuum in autovacuum? Since
parallel autovacuum introduces some logic to check for delay parameter
updates, I thought it was worth verifying if this adds any overhead.

BTW, in my view, the most challenging part of this patch is the
propagation logic for vacuum delay parameters. This propagation is
necessary because, unlike manual VACUUM, autovacuum workers can reload
their configuration during operation. We must ensure that parallel
workers stay synchronized with these updated parameters.

The current patch implements this in vacuumparallel.c: the leader
shares delay parameters in DSM and updates them (if any vacuum delay
parameters are updated) after a config reload, while workers poll for
updates at every vacuum_delay_point() call to refresh their local
variables.

Another possible approach would be an event-driven model where the
leader notifies workers after updating shared parameters—for example,
by adding a shm_mq between the leader (as the sender) and each worker
(as the receiver).

I've compared these two ideas and opted for the former (polling).
While a polling approach could theoretically be costly, the current
implementation is self-contained within the parallel vacuum logic and
does not touch the core parallel query infrastructure. The
notification approach might look more elegant, but I'm concerned it
adds unnecessary complexity just for the autovacuum case. Since the
polling is essentially just checking an atomic variable, the overhead
should be negligible.

To verify this, I conducted benchmarks comparing the whole execution
time and index vacuuming duration.

Setup:

- Disabled (auto) vacuum delays and buffer usage limits.
- Parallel autovacuum with 1 worker on a table with 2 indexes (approx.
4 GB each).
- 5 runs.

Case 1: The latest patch (with polling)

Average: 3.95s (Index: 1.54s)
Median: 3.62s (Index: 1.37s)

Case 2: The latest patch without polling

Average: 3.98s (Index: 1.56s)
Median: 3.70s (Index: 1.40s)

Note that in order to simulate the code that doesn't have the polling,
I reverted the following change:

- if (InterruptPending ||
- (!VacuumCostActive && !ConfigReloadPending))
+ if (InterruptPending)
+ return;
+
+ if (IsParallelWorker())
+ {
+ /*
+ * Update cost-based vacuum delay parameters for a parallel autovacuum
+ * worker if any changes are detected.
+ */
+ parallel_vacuum_update_shared_delay_params();
+ }
+
+ if (!VacuumCostActive && !ConfigReloadPending)

The parallel vacuum workers don't check the shared vacuum delay
parameter at all, which is still fine as I disabled vacuum delays.

Overall, the results show no noticeable overhead from the polling approach.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2026-04-01 19:10:18 Re: [PATCH] libpq: try all addresses for a host before moving to next on target_session_attrs mismatch
Previous Message Corey Huinker 2026-04-01 18:53:50 Re: Import Statistics in postgres_fdw before resorting to sampling.