Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Date: 2022-02-20 20:27:25
Message-ID: CAH2-WzkD9gX4MAGhmEX=_MbTjDXGr=1zCDSnC=BUdRPywJy64A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

,

On Sun, Feb 20, 2022 at 7:30 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Right, so we at least need to add a similar comment to what I proposed
> for MXIDs, and maybe other changes are needed, too.

Agreed.

> > Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed
> > here, to make all of this less confusing. I only now fully realized
> > how confusing all of this stuff is -- very.
>
> Right. I think I understand all of this, or at least most of it -- but
> not from the comment. The question is how the comment can be more
> clear. My general suggestion is that function header comments should
> have more to do with the behavior of the function than how it fits
> into the bigger picture. If it's clear to the reader what conditions
> must hold before calling the function and which must hold on return,
> it helps a lot. IMHO, it's the job of the comments in the calling
> function to clarify why we then choose to call that function at the
> place and in the way that we do.

You've given me a lot of high quality feedback on all of this, which
I'll work through soon. It's hard to get the balance right here, but
it's made much easier by this kind of feedback.

> I think that the idea has potential, but I don't think that I
> understand yet what the *exact* algorithm is.

The algorithm seems to exploit a natural tendency that Andres once
described in a blog post about his snapshot scalability work [1]. To a
surprising extent, we can usefully bucket all tuples/pages into two
simple categories:

1. Very, very old ("infinitely old" for all practical purposes).

2. Very very new.

There doesn't seem to be much need for a third "in-between" category
in practice. This seems to be at least approximately true all of the
time.

Perhaps Andres wouldn't agree with this very general statement -- he
actually said something more specific. I for one believe that the
point he made generalizes surprisingly well, though. I have my own
theories about why this appears to be true. (Executive summary: power
laws are weird, and it seems as if the sparsity-of-effects principle
makes it easy to bucket things at the highest level, in a way that
generalizes well across disparate workloads.)

> Maybe I need to read the
> code, when I have some time for that. I can't form an intelligent
> opinion at this stage about whether this is likely to be a net
> positive.

The code in the v8-0002 patch is a bit sloppy right now. I didn't
quite get around to cleaning it up -- I was focussed on performance
validation of the algorithm itself. So bear that in mind if you do
look at v8-0002 (might want to wait for v9-0002 before looking).

I believe that the only essential thing about the algorithm itself is
that it freezes all the tuples on a page when it anticipates setting
the page all-visible, or (barring edge cases) freezes none at all.
(Note that setting the page all-visible/all-frozen may be happen just
after lazy_scan_prune returns, or in the second pass over the heap,
after LP_DEAD items are set to LP_UNUSED -- lazy_scan_prune doesn't
care which way it will happen.)

There are one or two other design choices that we need to make, like
what exact tuples we freeze in the edge case where FreezeLimit/XID age
forces us to freeze in lazy_scan_prune. These other design choices
don't seem relevant to the issue of central importance, which is
whether or not we come out ahead overall with this new algorithm.
FreezeLimit will seldom affect our choice to freeze or not freeze now,
and so AFAICT the exact way that FreezeLimit affects which precise
freezing-eligible tuples we freeze doesn't complicate performance
validation.

Remember when I got excited about how my big TPC-C benchmark run
showed a predictable, tick/tock style pattern across VACUUM operations
against the order and order lines table [2]? It seemed very
significant to me that the OldestXmin of VACUUM operation n
consistently went on to become the new relfrozenxid for the same table
in VACUUM operation n + 1. It wasn't exactly the same XID, but very
close to it (within the range of noise). This pattern was clearly
present, even though VACUUM operation n + 1 might happen as long as 4
or 5 hours after VACUUM operation n (this was a big table).

This pattern was encouraging to me because it showed (at least for the
workload and tables in question) that the amount of unnecessary extra
freezing can't have been too bad -- the fact that we can always
advance relfrozenxid in the same way is evidence of that. Note that
the vacuum_freeze_min_age setting can't have affected our choice of
what to freeze (given what we see in the logs), and yet there is a
clear pattern where the pages (it's really pages, not tuples) that the
new algorithm doesn't freeze in VACUUM operation n will reliably get
frozen in VACUUM operation n + 1 instead.

And so this pattern seems to lend support to the general idea of
letting the workload itself be the primary driver of what pages we
freeze (not FreezeLimit, and not anything based on XIDs). That's
really the underlying principle behind the new algorithm -- freezing
is driven by workload characteristics (or page/block characteristics,
if you prefer). ISTM that vacuum_freeze_min_age is almost impossible
to tune -- XID age is just too squishy a concept for that to ever
work.

[1] https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462#interlude-removing-the-need-for-recentglobalxminhorizon
[2] https://postgr.es/m/CAH2-Wz=iLnf+0CsaB37efXCGMRJO1DyJw5HMzm7tp1AxG1NR2g@mail.gmail.com
-- scroll down to "TPC-C", which has the relevant autovacuum log
output for the orders table, covering a 24 hour period

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-02-20 20:47:31 Re: Adding CI to our tree (ccache)
Previous Message Tom Lane 2022-02-20 20:26:56 Re: libpq async duplicate error results