Re: decoupling table and index vacuum

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: decoupling table and index vacuum
Date: 2021-09-25 01:17:21
Message-ID: CAH2-Wz=G-D9yZtcTFKPB9ymxTQ2Q4gy_8-3_XkD9Zikti=RJSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 23, 2021 at 10:42 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> On Thu, Sep 16, 2021 at 7:09 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > Enabling index-only scans is a good enough reason to pursue this
> > project, even on its own.
>
> +1

I was hoping that you might be able to work on opportunistically
freezing whole pages for Postgres 15. I think that it would make sense
to opportunistically make a page that is about to become all_visible
during VACUUM become all_frozen instead. Our goal is to make most
pages skip all_visible, and go straight to all_frozen directly. Often
the page won't need to be dirtied again, ever.

Right now freezing is something that we mostly just think about as
occurring at the level of tuples, which doesn't seem ideal. This seems
related to Robert's project because both projects are connected to the
question of how autovacuum scheduling works in general. We will
probably need to rethink things like the vacuum_freeze_min_age GUC. (I
also think that we might need to reconsider how
aggressive/anti-wraparound VACUUMs work, but that's another story.)

Obviously this is a case of performing work eagerly; a form of
speculation that tries to lower costs in the aggregate, over time.
Heuristics that work well on average seem possible, but even excellent
heuristics could be wrong -- in the end we're trying to predict the
future, which is inherently impossible to do reliably for all
workloads. I think that that will be okay, provided that the cost of
being wrong is kept low and *fixed* (the exact definition of "fixed"
will need to be defined, but the basic idea is that any regression is
once per page, not once per page per VACUUM or something).

Once it's much cheaper enough to freeze a whole page early (i.e. all
tuple headers from all tuples), then the implementation can be wrong
95%+ of the time, and maybe we'll still win by a lot. That may sound
bad, until you realize that it's 95% *per VACUUM* -- the entire
situation is much better once you think about the picture for the
entire table over time and across many different VACUUM operations,
and once you think about FPIs in the WAL stream. We'll be paying the
cost of freezing in smaller and more predictable increments, too,
which can make the whole system more robust. Many pages that all go
from all_visible to all_frozen at the same time (just because they
crossed some usually-meaningless XID-based threshold) is actually
quite risky (this is why I mentioned aggressive VACUUMs in passing).

The hard part is getting the cost way down. lazy_scan_prune() uses
xl_heap_freeze_tuple records for each tuple it freezes. These
obviously have a lot of redundancy across tuples from the same page in
practice. And the WAL overhead is much larger just because these are
per-tuple records, not per-page records. Getting the cost down is hard
because of issues with MultiXacts, freezing xmin but not freezing xmax
at the same time, etc.

> Logging how vacuum uses and sets VM bits seems a good idea.

> I think that we will end up doubly counting the page as scanned_pages
> and allfrozen_pages due to the newly added latter change. This seems
> wrong to me because we calculate as follows:

I agree that that's buggy. Oops.

It was just a prototype that I wrote for my own work. I do think that
we should have a patch that has some of this, for users, but I am not
sure about the details just yet. This is probably too much information
for users, but I think it will take me more time to decide what really
does matter to users.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message houzj.fnst@fujitsu.com 2021-09-25 01:54:12 RE: Column Filtering in Logical Replication
Previous Message Alvaro Herrera 2021-09-25 00:48:48 Re: prevent immature WAL streaming