Re: Add autovacuum_warning to surface concurrent vacuum collisions

From: Hüseyin Demir <huseyin(dot)d3r(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>
Subject: Re: Add autovacuum_warning to surface concurrent vacuum collisions
Date: 2026-06-26 14:28:10
Message-ID: 178248409074.991.5212872724899470915.pgcf@coridan.postgresql.org
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

It is an area that genuinely lacks some observability however, I have some concerns about this patch that I think we can review.
The condition being logged is correct, intentional behavior. The skip mechanism is designed exactly for this case:
Worker B backs off, moves to the next table, and the system makes progress. Logging correct behavior as if it were a warning conflates a healthy scheduler decision with a fault condition.
On any busy OLTP system with autovacuum_max_workers > 1, workers will skip tables held by other workers in every vacuum cycle. That is not a transient edge case and it is the steady state of a loaded database.

This means the GUC has two operating modes: on a quiet system it never fires (no value), and on a busy system it always fires (pure noise).
The checkpoint_warning analogy does not hold up. checkpoint_warning fires when the system deviates from healthy behavior (checkpoints too frequent). This GUC fires during an expected behavior. Furthermore, checkpoint_warning is an integer (seconds) with built-in rate limiting via elapsed time comparison; a bare boolean offers none of that, so on a loaded system it would emit one log line per skipped table per vacuum cycle per worker.
If the goal is to detect genuine autovacuum saturation, I think an example case would be a. worker that completes an entire vacuum cycle having done no work at all because every candidate table was already held by another worker. That condition is already tracked, fires once per wasted cycle rather than once per table, and is a strong signal that a worker slot was completely wasted. That is worth a single LOG

Also when it comes to name of GUC shouldn't we follow the log_autovacuum_* pattern ?

Regards,
Demir.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2026-06-26 14:35:04 PostgreSQL 19 Beta 2 release date
Previous Message Pavel Borisov 2026-06-26 13:47:06 Re: Add SPLIT PARTITION/MERGE PARTITIONS commands