Re: another autovacuum scheduling thread

From: Sami Imseih <samimseih(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: another autovacuum scheduling thread
Date: 2025-11-24 15:19:12
Message-ID: CAA5RZ0t8f0u7x2p8y+7UTbSOLZZ4_WPgdt3TBeGkxgt6vn5ASA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > What I have not been able to prove from my tests is that the processing
> > order of tables by autovacuum will actually make things any better or any
> > worse. My tests have been short 30 minute tests that count how many
> > vacuum cycles tables with various DML activity and sizes received.
> > I have not found much difference. I am also not sure how valuable
> > these short-duration tests are either.
>
> Yeah, I'm not sure that would be the right way to look for a benefit
> from something like this. I think that a better test scenario might
> involve figuring out how fast we can recover from a bad situation. As
> we've discussed before, if VACUUM is chronically unable to keep up
> with the workload, then the system is going to get into a very bad
> state and there's not really any help for it. But if we start to get
> into a bad situation due to some outside interference and then someone
> removes the interference, we might hope that this patch would help us
> get back on our feet more quickly.
>
> For instance, suppose that we have a database with a stale replication
> slot, so the oldest-XID value for the cluster keeps getting older and
> older. autovacuum is probably running but it can't clean anything up.
> Then at some point, the DBA realizes that bad things are happening and
> drops the replication slot. You might hope that, with the patch,
> autovacuum would do a better job getting the system back to a working
> state. If you set up some kind of test scenario, you could ask
> questions like "what is the largest age(relfrozenxid) that we observe
> in the database at any point during the test?" or "from the time the
> replication slot is dropped, how much time passes before
> age(datfrozenxid) drops to normal?" or "what is the maximum observed
> amount of bloat during the test?".

From my experience, in these situations, you need to run manual vacuums
to supplement autovacuum and get bloat under control as quickly as
possible. If the tables are small and vacuum quickly, the order of
prioritization doesn’t matter much, even with the extra bloat or high XID
age. However, if your slow-to-vacuum tables have the most bloat or the
oldest XID age, prioritizing those tables means that your smaller,
faster-to-vacuum
tables will almost certainly not get the vacuum cycles quickly enough
after resolving
whatever was blocking the vacuum (such as long-running transactions or stale
replication slots). In the current system, these smaller tables might still
get vacuumed, but often by pure chance due to the pg_class ordering.

Speeding up recovery (removing bloat and freezing rows) as soon as possible
will require enabling more autovacuum workers ( which are now dynamic )
or running manual vacuums. I don't think prioritization will improve these
situations much.

--
Sami Imseih
Amazon Web Services (AWS)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Viktor Holmberg 2025-11-24 15:23:17 Re: ON CONFLICT DO SELECT (take 3)
Previous Message Ivan Bykov 2025-11-24 15:09:03 Re: IPC/MultixactCreation on the Standby server