Re: should vacuum's first heap pass be read-only?

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: should vacuum's first heap pass be read-only?
Date: 2022-04-05 20:29:38
Message-ID: CAH2-Wz=igvvGSPzXvhdgy3v69X02MK7Yg7To_YF4Oj2hrv3ZvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 5, 2022 at 1:10 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I had assumed that this would not be the case, because if the page is
> being accessed by the workload, it can be pruned - and probably frozen
> too, if we wanted to write code for that and spend the cycles on it -
> and if it isn't, pruning and freezing probably aren't needed.

VACUUM has a top-down structure, and so seems to me to be the natural
place to think about the high level needs of the table as a whole,
especially over time.

I don't think we actually need to scan the pages that we left some
LP_DEAD items in previous VACUUM operations. It seems possible to
freeze newly appended pages quite often, without needlessly revisiting
the pages from previous batches (even those with LP_DEAD items left
behind). Maybe we need to rethink the definition of "VACUUM operation"
a little to do that, but it seems relatively tractable.

As I said upthread recently, I am excited about the potential of
"locking in" a set of scanned_pages using a local/private version of
the visibility map (a copy from just after OldestXmin is initially
established), that VACUUM can completely work off of. Especially if
combined with the conveyor belt, which could make VACUUM operations
suspendable and resumable.

I don't see any reason why it wouldn't be possible to "lock in" an
initial scanned_pages, and then use that data structure (which could
be persisted) to avoid revisiting the pages that we know we already
visited (and left LP_DEAD items in). We could "resume the VACUUM
operation that was suspended earlier" a bit later (not have several
technically unrelated VACUUM operations together in close succession).
The later rounds of processing could even use new cutoffs for both
freezing and freezing, despite being from "the same VACUUM operation".
They could have an "expanded rel_pages" that covers the newly appended
pages that we want to quickly freeze tuples on.

AFAICT the only thing that we need to do to make this safe is to carry
forward our original vacrel->NewRelfrozenXid (which can never be later
than our original vacrel->OldestXmin). Under this architecture, we
don't really "skip index vacuuming" at all. Rather, we redefine VACUUM
operations in a way that makes the final rel_pages provisional, at
least when run in autovacuum.

VACUUM itself can notice that it might be a good idea to "expand
rel_pages" and expand the scope of the work it ultimately does, based
on the observed characteristics of the table. No heap pages get repeat
processing per "VACUUM operation" (relative to the current definition
of the term). Some indexes will get "extra, earlier index vacuuming",
which we've already said is the right way to think about all this (we
should think of it as extra index vacuuming, not less index
vacuuming).

> > But, these same LP_DEAD-heavy tables *also* have a very decent
> > chance of benefiting from a better index vacuuming strategy, something
> > *also* enabled by the conveyor belt design. So overall, in either scenario,
> > VACUUM concentrates on problems that are particular to a given table
> > and workload, without being hindered by implementation-level
> > restrictions.
>
> Well this is what I'm not sure about. We need to demonstrate that
> there are at least some workloads where retiring the LP_DEAD line
> pointers doesn't become the dominant concern.

It will eventually become the dominant concern. But that could take a
while, compared to the growth in indexes.

An LP_DEAD line pointer stub in a heap page is 4 bytes. The smallest
possible B-Tree index tuple is 20 bytes on mainstream platforms (16
bytes + 4 byte line pointer). Granted deduplication makes this less
true, but that's far from guaranteed to help. Also, many tables have
way more than one index.

Of course it isn't nearly as simple as comparing the bytes of bloat in
each case. More generally, I don't claim that it's easy to
characterize which factor is more important, even in the abstract,
even under ideal conditions -- it's very hard. But I'm sure that there
are routinely very large differences among indexes and the heap
structure.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-04-05 20:40:19 Re: shared-memory based stats collector - v66
Previous Message David Rowley 2022-04-05 20:21:40 Re: ExecRTCheckPerms() and many prunable partitions