Re: Report oldest xmin source when autovacuum cannot remove tuples

From: Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>
To: Sami Imseih <samimseih(at)gmail(dot)com>
Cc: Japin Li <japinli(at)hotmail(dot)com>, wenhui qiu <qiuwenhuifx(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Report oldest xmin source when autovacuum cannot remove tuples
Date: 2026-06-26 09:07:48
Message-ID: CAOzEurS4WCPOzCm4j1fa2uJkJEw=4X_w8nyn+UBeWnANMsxWeQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for the review!

On Tue, Jun 9, 2026 at 1:36 AM Sami Imseih <samimseih(at)gmail(dot)com> wrote:

> 1/ Reporting the GID of a prepared transaction.
>
> This seems unnecessary. The logged xid is sufficient since a DBA
> can look up the GID via pg_prepared_xacts using the "transaction"
> column. I think we can remove GetPreparedTransactionGid() and
> just report the xid.

I'd prefer to keep the GID. pg_prepared_xacts is volatile, so by the
time someone analyzes the VACUUM log, the transaction is usually gone.
Logging the GID makes retrospective investigation much easier for
DBAs. Given the minimal overhead, I've kept it in v6.

> 2/ Reporting the application_name of the standby.
>
> This could be valuable assuming the user sets application_name in the
> primary_conninfo string to make the standby distinguishable. The
> documentation [1] says this "should be set" for physical standbys,
> and in practice most deployments do so since
> synchronous_standby_names matches on it.
>
> However, instead of GetStandbyAppname(), can we use
> pgstat_get_beentry_by_proc_number() to get the application name and
> set it in dst->name while holding the shared ProcArrayLock? This is
> safe if we call pgstat_fetch_stat_numbackends() beforehand to prime
> the local backend status cache. The subsequent lookup is just a
> bsearch on local memory, so it adds negligible time under the lock.

Adopted, thanks. This is much cleaner than the hand-rolled scan. I
record a proc_number during the scan — for both the ProcArray case and
the slot-based standby case — and look the application_name up from
the walsender's backend status entry.

One difference from your sketch is that the lookup runs after all
locks are released, not while holding ProcArrayLock, so there is no
need to prime the snapshot beforehand and no pgstat work happens under
the lock. Keying on proc_number instead of pid makes it robust against
pid reuse, and I kept an st_procpid check as a sanity guard.
GetStandbyAppname() is removed.

> 3/ Over-allocation in GetXidHorizonBlockers().
>
> + /*
> + * Allocate enough space for every PGPROC plus all replication slots. This
> + * is a generous upper bound (typically only 0-2 entries are returned),
> + * but keeps the logic simple for a diagnostic function that runs
> + * infrequently.
> + */
> + max_blockers = arrayP->maxProcs + max_replication_slots;
> + result = palloc0_array(XidHorizonBlocker, max_blockers);
>
> This looks wasteful, especially with large max_connections. In
> practice only 1-3 entries are returned. Can we start with a small
> allocation (say 8) and repalloc when needed?

The other ProcArray scanners (GetCurrentVirtualXIDs,
GetConflictingVirtualXIDs) pre-size their result array to maxProcs and
allocate before taking ProcArrayLock, precisely so the scan never
allocates under the lock. Growing on demand would repalloc under
ProcArrayLock, so I kept the pre-sized allocation to match them.

I think the real issue was the per-entry size rather than the entry
count. The scan now collects into a lean candidate struct that does
not carry the inline name[Max(GIDSIZE, NAMEDATALEN)], so each entry is
~20 bytes, on par with the sibling scanners (VirtualTransactionId is
8). The array is still pre-sized to maxProcs and allocated before
ProcArrayLock, exactly as
GetCurrentVirtualXIDs()/GetConflictingVirtualXIDs() do, so we keep
avoiding any allocation under the lock and avoid repalloc-on-demand
under it.

The name (prepared-xact GID, standby application_name or slot name) is
resolved only for the one blocker that is reported, in a new
FillXidHorizonBlocker() that runs after the scan. A nice side effect
is that the per-blocker enrichment — GetPreparedTransactionGid() under
TwoPhaseStateLock, and the standby application_name lookup via
pgstat_get_beentry_by_proc_number() — now happens once for the
selected blocker instead of for every candidate. The slot-name lookup
reuses the existing ReplicationSlotName(index, &name).

This also keeps the infrastructure friendly to the future SQL-callable
view you mentioned: FillXidHorizonBlocker() is the reusable per-row
enrichment step, so a view that lists every blocker would just call it
per row, while the VACUUM log path calls it once.

> 4/ Simplify the loop body by removing candidate_* temporaries.
>
> The values we set in dst don't need to be tracked in separate
> variables. If dst gets populated for the proc, we can just do:
>
> ```
> if (dst)
> {
> dst->pid = proc->pid;
> dst->xid = horizon;
> dst->proc_number = pgprocno;
> }
> ```
>
> at the end of each iteration.

Done.

> 5/ GetXidHorizonBlocker() priority scan.
>
> +/*
> + * Get the highest-priority blocker holding back the xid horizon.
> + *
> + * Returns true and stores the blocker in *blocker if any are found.
> + */
>
> The priority scan makes sense and addresses the concern I raised
> earlier [2] about reporting the wrong blocker due to ProcArray
> ordering. Without it we report whichever blocker happens to come
> first in ProcArray order. If an xmin-holder sits at a lower index
> than the xid-holder, we would report the less actionable one. The
> xid-holder is more useful to surface because killing that single
> transaction advances the horizon, whereas xmin-holders only matter
> once all of them release.
>
> One small optimization: exit the loop early once we find an xid-match
> type, since nothing can beat it:
>
> for (int i = 0; i < nblockers; i++)
> {
> if (best == NULL || blockers[i].type < best->type)
> {
> best = &blockers[i];
>
> /* xid-match types can't be beaten, stop early */
> if (best->type <= XHB_PREPARED_TRANSACTION)
> break;
> }
> }

Added. The scan now breaks as soon as best->type <= XHB_PREPARED_TRANSACTION.

> Finally, the infrastructure added here (GetXidHorizonBlockers,
> XidHorizonBlocker struct, the enum-based priority) will also underpin
> the future SQL-callable view for real-time blocker reporting, so I
> think what is being done here is good for that case also. I would
> also think that GetXidHorizonBlockers() and friends should be a
> separate patch, and the vacuum logging and future SQL view work can
> build off of it.

Agreed. v6 is split into two patches:

0001 - GetXidHorizonBlockers()/GetXidHorizonBlocker(), the
XidHorizonBlocker struct and the priority enum, which is the reusable
infrastructure.
0002 - the VACUUM and autovacuum logging that consumes it.

On Thu, May 28, 2026 at 2:34 PM Shinya Kato <shinya11(dot)kato(at)gmail(dot)com> wrote:
> When hot_standby_feedback=on, the location where the standby's xmin is
> recorded on the primary depends on whether physical replication slot
> is used:
>
> - Without a replication slot: the xmin is held on the walsender's PGPROC
> - With a replication slot: the xmin is held on the replication slot itself
>
> I summarized this behavior in my blog:
> https://dev.to/shinyakato_/4-causes-of-table-bloat-in-postgresql-and-how-to-address-them-3ec9
>
> The current patch reports the latter case as XHB_REPLICATION_SLOT with
> the message "logical replication slot", which is misleading because it
> is actually a physical slot used by a standby with
> hot_standby_feedback=on. I will fix this and post an updated patch
> later.

These patches accomplish this.

--
Best regards,
Shinya Kato
NTT OSS Center

Attachment Content-Type Size
v6-0001-Add-infrastructure-to-identify-what-holds-back-th.patch application/x-patch 18.6 KB
v6-0002-Report-oldest-xmin-blocker-when-VACUUM-cannot-rem.patch application/x-patch 17.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeevan Chalke 2026-06-26 09:11:33 ON EMPTY clause for aggregate and window functions
Previous Message Chao Li 2026-06-26 08:58:50 Clear base backup progress reporting on error