Re: POC: Parallel processing of indexes in autovacuum

From: Sami Imseih <samimseih(at)gmail(dot)com>
To: Daniil Davydov <3danissimo(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Matheus Alcantara <matheusssilv97(at)gmail(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: POC: Parallel processing of indexes in autovacuum
Date: 2025-05-22 17:48:07
Message-ID: CAA5RZ0twipMFOv0ag9Hx4z1APoo5mRu7T1t+OebAMtJmhttaig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I started looking at the patch but I have some high level thoughts I would
like to share before looking further.

> > I find that the name "autovacuum_reserved_workers_num" is generic. It
> > would be better to have a more specific name for parallel vacuum such
> > as autovacuum_max_parallel_workers. This parameter is related to
> > neither autovacuum_worker_slots nor autovacuum_max_workers, which
> > seems fine to me. Also, max_parallel_maintenance_workers doesn't
> > affect this parameter.
> > .......
> > I've also considered some alternative names. If we were to use
> > parallel_maintenance_workers, it sounds like it controls the parallel
> > degree for all operations using max_parallel_maintenance_workers,
> > including CREATE INDEX. Similarly, vacuum_parallel_workers could be
> > interpreted as affecting both autovacuum and manual VACUUM commands,
> > suggesting that when users run "VACUUM (PARALLEL) t", the system would
> > use their specified value for the parallel degree. I prefer
> > autovacuum_parallel_workers or vacuum_parallel_workers.
> >
>
> This was my headache when I created names for variables. Autovacuum
> initially implies parallelism, because we have several parallel a/v
> workers. So I think that parameter like
> `autovacuum_max_parallel_workers` will confuse somebody.
> If we want to have a more specific name, I would prefer
> `max_parallel_index_autovacuum_workers`.

I don't think we should have a separate pool of parallel workers for those
that are used to support parallel autovacuum. At the end of the day, these
are parallel workers and they should be capped by max_parallel_workers. I think
it will be confusing if we claim these are parallel workers, but they
are coming from
a different pool.

I envision we have another GUC such as "max_parallel_autovacuum_workers"
(which I think is a better name) that matches the behavior of
"max_parallel_maintenance_worker". Meaning that the autovacuum workers
still maintain their existing behavior ( launching a worker per table
), and if they do need
to vacuum in parallel, they can draw from a pool of parallel workers.

With the above said, I therefore think the reloption should actually be a number
of parallel workers rather than a boolean. Let's take an example of a
user that has 3 tables
they wish to (auto)vacuum can process in parallel, and if available
they wish each of these tables
could be autovacuumed with 4 parallel workers. However, as to not
overload the system, they
cap the 'max_parallel_maintenance_worker' to something like 8. If it
so happens that all
3 tables are auto-vacuumed at the same time, there may not be enough
parallel workers,
so one table will be a loser and be vacuumed in serial. That is
acceptable, and a/v logging
( and perhaps other stat views ) should display this behavior: workers
planned vs workers launched.

thoughts?

--
Sami Imseih
Amazon Web Services (AWS)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2025-05-22 17:57:57 Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Previous Message Erik Nordström 2025-05-22 17:14:31 Re: Relstats after VACUUM FULL and CLUSTER