| From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
|---|---|
| To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
| Cc: | SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, Daniil Davydov <3danissimo(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Sami Imseih <samimseih(at)gmail(dot)com>, Matheus Alcantara <matheusssilv97(at)gmail(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: POC: Parallel processing of indexes in autovacuum |
| Date: | 2026-04-02 23:30:42 |
| Message-ID: | CAD21AoACRVCT-ub+LTAtDaEZjxmwFcC7ON9_jfqpYegPdeXXOA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Thu, Apr 2, 2026 at 4:02 AM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> Hi!
>
> On Wed, Apr 1, 2026 at 9:55 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Mon, Mar 30, 2026 at 5:14 PM SATYANARAYANA NARLAPURAM
> > <satyanarlapuram(at)gmail(dot)com> wrote:
> > >
> > > Hi
> > >
> > > On Mon, Mar 30, 2026 at 1:44 AM Daniil Davydov <3danissimo(at)gmail(dot)com> wrote:
> > >>
> > >> Hi,
> > >>
> > >> On Mon, Mar 30, 2026 at 7:17 AM SATYANARAYANA NARLAPURAM
> > >> <satyanarlapuram(at)gmail(dot)com> wrote:
> > >> >
> > >> > Thank you for working on this, very useful feature. Sharing a few thoughts:
> > >> >
> > >> > 1. Shouldn't we also cap by max_parallel_workers to avoid wasting DSM resources in parallel_vacuum_compute_workers?
> > >>
> > >> Actually, autovacuum_max_parallel_workers is already limited by
> > >> max_parallel_workers. It is not clear for me why we allow setting this GUC
> > >> higher than max_parallel_workers, but if this happens, I think it is a user's
> > >> misconfiguration.
> > >>
> > >> > 2. Is it intentional that other autovacuum workers not yield cost limits to the parallel auto vacuum workers? Cost limits are distributed first equally to the autovacuum workers.
> > >> > and then they share that. Therefore, parallel workers will be heavily throttled. IIUC, this problem doesn't exist with manual vacuum.
> > >> > If we don't fix this, at least we should document this.
> > >>
> > >> Parallel a/v workers inherit cost based parameters (including the
> > >> vacuum_cost_limit) from the leader worker. Do you mean that this can be too
> > >> low value for parallel operation? If so, user can manually increase the
> > >> vacuum_cost_limit reloption for those tables, where parallel a/v sleeps too
> > >> much (due to cost delay).
> > >>
> > >> BTW, describing the cost limit propagation to the parallel a/v workers is
> > >> worth mentioning in the documentation. I'll add it in the next patch version.
> > >>
> > >> > 3. Additionally, is there a point where, based on the cost limits, launching additional workers becomes counterproductive compared to running fewer workers and preventing it?
> > >>
> > >> I don't think that we can possibly find a universal limit that will be
> > >> appropriate for all possible configurations. By now we are using a pretty
> > >> simple formula for parallel degree calculation. Since user have several ways
> > >> to affect this formula, I guess that there will be no problems with it (except
> > >> my concerns about opt-out style).
> > >>
> > >> > 4. Would it make sense to add a table level override to disable parallelism or set parallel worker count?
> > >>
> > >> We already have the "autovacuum_parallel_workers" reloption that is used as
> > >> an additional limit for the number of parallel workers. In particular, this
> > >> reloption can be used to disable parallelism at all.
> > >>
> > >> >
> > >> > I ran some perf tests to show the improvements with parallel vacuum and shared below.
> > >>
> > >> Thank you very much!
> > >>
> > >> > Observations:
> > >> >
> > >> > 1. Parallel autovacuum provides consistent speedup. With cost_limit=200 and
> > >> > 7 workers, vacuum completes 1.41x faster (71s -> 50s). With cost_limit=60,
> > >> > the speedup is 1.25x (194s -> 154s).
> > >> > 2. I see the benefit comes from parallelizing index vacuum. With 8 indexes totaling
> > >> > ~530 MB, parallel workers scan indexes concurrently instead of the leader
> > >> > scanning them one by one. The leader's CPU user time drops from ~3s to
> > >> > ~0.8s as index work is offloaded
> > >> >
> > >>
> > >> 1.41 speedup with 7 parallel workers may not seem like a great win, but it is
> > >> a whole time of autovacuum operation (not only index bulkdel/cleanup) with
> > >> pretty small indexes.
> > >>
> > >> May I ask you to run the same test with a higher table's size (several dozen
> > >> gigabytes)? I think the results will be more "expressive".
> > >
> > >
> > > I ran it with a Billion rows in a table with 8 indexes. The improvement with 7 workers is 1.8x.
> > > Please note that there is a fixed overhead in other vacuum steps, for example heap scan.
> > > In the environments where cost-based delay is used (the default), benefits will be modest
> > > unless vacuum_cost_delay is set to sufficiently large value.
> > >
> > > Hardware:
> > > CPU: Intel Xeon Platinum 8573C, 1 socket × 8 cores × 2 threads = 16 vCPUs
> > > RAM: 128 GB (131,900 MB)
> > > Swap: None
> > >
> > > Workload Description
> > >
> > > Table Schema:
> > > CREATE TABLE avtest (
> > > id bigint PRIMARY KEY,
> > > col1 int, -- random()*1e9
> > > col2 int, -- random()*1e9
> > > col3 int, -- random()*1e9
> > > col4 int, -- random()*1e9
> > > col5 int, -- random()*1e9
> > > col6 text, -- 'text_' || random()*1e6 (short text ~10 chars)
> > > col7 timestamp, -- now() - random()*365 days
> > > padding text -- repeat('x', 50)
> > > ) WITH (fillfactor = 90);
> > >
> > > Indexes (8 total):
> > > avtest_pkey — btree on (id) bigint
> > > idx_av_col1 — btree on (col1) int
> > > idx_av_col2 — btree on (col2) int
> > > idx_av_col3 — btree on (col3) int
> > > idx_av_col4 — btree on (col4) int
> > > idx_av_col5 — btree on (col5) int
> > > idx_av_col6 — btree on (col6) text
> > > idx_av_col7 — btree on (col7) timestamp
> > >
> > > Dead Tuple Generation:
> > > DELETE FROM avtest WHERE id % 5 IN (1, 2);
> > > This deletes exactly 40% of rows, uniformly distributed across all pages.
> > >
> > > Vacuum Trigger:
> > > Autovacuum is triggered naturally by lowering the threshold to 0 and setting
> > > scale_factor to a value that causes immediate launch after the DELETE.
> > >
> > > Worker Configurations Tested:
> > > 0 workers — leader-only vacuum (baseline, no parallelism)
> > > 2 workers — leader + 2 parallel workers (3 processes total)
> > > 4 workers — leader + 4 parallel workers (5 processes total)
> > > 7 workers — leader + 7 parallel workers (8 processes total, 1 per index)
> > >
> > > Dataset:
> > > Rows: 1,000,000,000
> > > Heap size: 139 GB
> > > Total size: 279 GB (heap + 8 indexes)
> > > Dead tuples: 400,000,000 (40%)
> > >
> > > Index Sizes:
> > > avtest_pkey 21 GB (bigint)
> > > idx_av_col7 21 GB (timestamp)
> > > idx_av_col1 18 GB (int)
> > > idx_av_col2 18 GB (int)
> > > idx_av_col3 18 GB (int)
> > > idx_av_col4 18 GB (int)
> > > idx_av_col5 18 GB (int)
> > > idx_av_col6 7 GB (text — shorter keys, smaller index)
> > > Total indexes: 139 GB
> > >
> > > Server Settings:
> > > shared_buffers = 96GB
> > > maintenance_work_mem = 1GB
> > > max_wal_size = 100GB
> > > checkpoint_timeout = 1h
> > > autovacuum_vacuum_cost_delay = 0ms (NO throttling)
> > > autovacuum_vacuum_cost_limit = 1000
> > >
> > >
> > > Summary:
> > >
> > > Workers Avg(s) Min(s) Max(s) Speedup Time Saved
> > > ------- ------ ------ ------ ------- ----------
> > > 0 1645.93 1645.01 1646.84 1.00x —
> > > 2 1276.35 1275.64 1277.05 1.29x 369.58s (6.2 min)
> > > 4 1052.62 1048.92 1056.32 1.56x 593.31s (9.9 min)
> > > 7 892.23 886.59 897.86 1.84x 753.70s (12.6 min)
> > >
> >
> > Thank you for sharing the performance test results!
> >
> > While the benchmark results look good to me, have you compared the
> > performance differences between parallel vacuum in the VACUUM command
> > (with the PARALLEL option) and parallel vacuum in autovacuum? Since
> > parallel autovacuum introduces some logic to check for delay parameter
> > updates, I thought it was worth verifying if this adds any overhead.
> >
> > BTW, in my view, the most challenging part of this patch is the
> > propagation logic for vacuum delay parameters. This propagation is
> > necessary because, unlike manual VACUUM, autovacuum workers can reload
> > their configuration during operation. We must ensure that parallel
> > workers stay synchronized with these updated parameters.
> >
> > The current patch implements this in vacuumparallel.c: the leader
> > shares delay parameters in DSM and updates them (if any vacuum delay
> > parameters are updated) after a config reload, while workers poll for
> > updates at every vacuum_delay_point() call to refresh their local
> > variables.
> >
> > Another possible approach would be an event-driven model where the
> > leader notifies workers after updating shared parameters—for example,
> > by adding a shm_mq between the leader (as the sender) and each worker
> > (as the receiver).
> >
> > I've compared these two ideas and opted for the former (polling).
> > While a polling approach could theoretically be costly, the current
> > implementation is self-contained within the parallel vacuum logic and
> > does not touch the core parallel query infrastructure. The
> > notification approach might look more elegant, but I'm concerned it
> > adds unnecessary complexity just for the autovacuum case. Since the
> > polling is essentially just checking an atomic variable, the overhead
> > should be negligible.
> >
> > To verify this, I conducted benchmarks comparing the whole execution
> > time and index vacuuming duration.
> >
> > Setup:
> >
> > - Disabled (auto) vacuum delays and buffer usage limits.
> > - Parallel autovacuum with 1 worker on a table with 2 indexes (approx.
> > 4 GB each).
> > - 5 runs.
> >
> > Case 1: The latest patch (with polling)
> >
> > Average: 3.95s (Index: 1.54s)
> > Median: 3.62s (Index: 1.37s)
> >
> > Case 2: The latest patch without polling
> >
> > Average: 3.98s (Index: 1.56s)
> > Median: 3.70s (Index: 1.40s)
> >
> > Note that in order to simulate the code that doesn't have the polling,
> > I reverted the following change:
> >
> > - if (InterruptPending ||
> > - (!VacuumCostActive && !ConfigReloadPending))
> > + if (InterruptPending)
> > + return;
> > +
> > + if (IsParallelWorker())
> > + {
> > + /*
> > + * Update cost-based vacuum delay parameters for a parallel autovacuum
> > + * worker if any changes are detected.
> > + */
> > + parallel_vacuum_update_shared_delay_params();
> > + }
> > +
> > + if (!VacuumCostActive && !ConfigReloadPending)
> >
> > The parallel vacuum workers don't check the shared vacuum delay
> > parameter at all, which is still fine as I disabled vacuum delays.
> >
> > Overall, the results show no noticeable overhead from the polling approach.
>
> I would say this polling approach is very cheap. When there are no
> updates, it only has to check a single 32-bit value from shared
> memory. And that value doesn't get updated frequently; it's good for
> caching. No wonder we see no measurable overhead.
Thank you for the comments!
>
> Regarding the event-driven approach, given that the parallel worker
> process is busy with other jobs (doing actual vacuuming), it would
> anyway have to poll for new events from time to time. Thus, I don't
> think it's possible to organize polling for new events any cheaper
> than the current approach of polling for updates in shmem.
What do you think about the idea of using proc signals like the patch
I've sent recently[1]? With that approach, workers have to check the
local variable. It seems slightly cheaper and can use the existing
logic.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2026-04-02 23:43:33 | Re: pg_plan_advice |
| Previous Message | Lukas Fittl | 2026-04-02 23:16:13 | Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? |