Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Date: 2021-11-23 01:07:46
Message-ID: CAH2-WzmWk8zJGfjwyczjohX6CS=8aGoFpivTWg4aF28xCQ0VnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 22, 2021 at 11:29 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hm. I'm a bit doubtful that there's all that many cases where it's worth not
> pruning during vacuum. However, it seems much more common for opportunistic
> pruning during non-write accesses.

Fair enough. I just wanted to suggest an exploratory conversation
about pruning (among several other things). I'm mostly saying: hey,
pruning during VACUUM isn't actually that special, at least not with
this refactoring patch in place. So maybe it makes sense to go
further, in light of that general observation about pruning in VACUUM.

Maybe it wasn't useful to even mention this aspect now. I would rather
focus on freezing optimizations for now -- that's much more promising.

> Perhaps checking whether we'd log an FPW would be a better criteria for
> deciding whether to prune or not compared to whether we're dirtying the page?
> IME the WAL volume impact of FPWs is a considerably bigger deal than
> unnecessarily dirtying a page that has previously been dirtied in the same
> checkpoint "cycle".

Agreed. (I tend to say the former when I really mean the latter, which
I should try to avoid.)

> IDK, the potential of not having usable space on an overfly fragmented page
> doesn't seem that low. We can't just mark such pages as all-visible because
> then we'll potentially never reclaim that space.

Don't get me started on this - because I'll never stop.

It makes zero sense that we don't think about free space holistically,
using the whole context of what changed in the recent past. As I think
you know already, a higher level concept (like open and closed pages)
seems like the right direction to me -- because it isn't sensible to
treat X bytes of free space in one heap page as essentially
interchangeable with any other space on any other heap page. That
misses an enormous amount of things that matter. The all-visible
status of a page is just one such thing.

> IMO the big difference between aggressive / non-aggressive isn't whether we
> wait for a cleanup lock, but that we don't skip all-visible pages...

I know what you mean by that, of course. But FWIW that definition
seems too focused on what actually happens today, rather than what is
essential given the invariants we have for VACUUM. And so I personally
prefer to define it as "a VACUUM that *reliably* advances
relfrozenxid". This looser definition will probably "age" well (ahem).

> > This new relfrozenxid optimization might not be all that valuable on its
> > own, but it may still facilitate future work that makes non-aggressive
> > VACUUMs more conscious of the benefit of advancing relfrozenxid sooner
> > rather than later. In general it would be useful for non-aggressive
> > VACUUMs to be "more aggressive" opportunistically (e.g., by waiting for
> > a cleanup lock once or twice if needed).
>
> What do you mean by "waiting once or twice"? A single wait may simply never
> end on a busy page that's constantly pinned by a lot of backends...

I was speculating about future work again. I think that you've taken
my words too literally. This is just a draft commit message, just a
way of framing what I'm really trying to do.

Sure, it wouldn't be okay to wait *indefinitely* for any one pin in a
non-aggressive VACUUM -- so "at least waiting for one or two pins
during non-aggressive VACUUM" might not have been the best way of
expressing the idea that I wanted to express. The important point is
that _we can make a choice_ about stuff like this dynamically, based
on the observed characteristics of the table, and some general ideas
about the costs and benefits (of waiting or not waiting, or of how
long we want to wait in total, whatever might be important). This
probably just means adding some heuristics that are pretty sensitive
to any reason to not do more work in a non-aggressive VACUUM, without
*completely* balking at doing even a tiny bit more work.

For example, we can definitely afford to wait a few more milliseconds
to get a cleanup lock just once, especially if we're already pretty
sure that that's all the extra work that it would take to ultimately
be able to advance relfrozenxid in the ongoing (non-aggressive) VACUUM
-- it's easy to make that case. Once you agree that it makes sense
under these favorable circumstances, you've already made
"aggressiveness" a continuous thing conceptually, at a high level.

The current binary definition of "aggressive" is needlessly
restrictive -- that much seems clear to me. I'm much less sure of what
specific alternative should replace it.

I've already prototyped advancing relfrozenxid using a dynamically
determined value, so that our final relfrozenxid is just about the
most recent safe value (not the original FreezeLimit). That's been
interesting. Consider this log output from an autovacuum with the
prototype patch (also uses my new instrumentation), based on standard
pgbench (just tuned heap fill factor a bit):

LOG: automatic vacuum of table "regression.public.pgbench_accounts":
index scans: 0
pages: 0 removed, 909091 remain, 33559 skipped using visibility map
(3.69% of total)
tuples: 297113 removed, 50090880 remain, 90880 are dead but not yet removable
removal cutoff: oldest xmin was 29296744, which is now 203341 xact IDs behind
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
I/O timings: read: 55.574 ms, write: 0.000 ms
avg read rate: 17.805 MB/s, avg write rate: 4.389 MB/s
buffer usage: 1728273 hits, 23150 misses, 5706 dirtied
WAL usage: 594211 records, 0 full page images, 35065032 bytes
system usage: CPU: user: 6.85 s, system: 0.08 s, elapsed: 10.15 s

All of the autovacuums against the accounts table look similar to this
one -- you don't see anything about relfrozenxid being advanced
(because it isn't). Whereas for the smaller pgbench tables, every
single VACUUM successfully advances relfrozenxid to a fairly recent
XID (without there ever being an aggressive VACUUM) -- just because
VACUUM needs to visit every page for the smaller tables. While the
accounts table doesn't generally need to have 100% of all pages
touched by VACUUM -- it's more like 95% there. Does that really make
sense, though?

I'm pretty sure that less aggressive VACUUMing (e.g. higher
scale_factor setting) would lead to more aggressive setting of
relfrozenxid here. I'm always suspicious when I see insignificant
differences that lead to significant behavioral differences. Am I
worried over nothing here? Perhaps -- we don't really need to advance
relfrozenxid early with this table/workload anyway. But I'm not so
sure.

Again, my point is that there is a good chance that redefining
aggressiveness in some way will be helpful. A more creative, flexible
definition might be just what we need. The details are very much up in
the air, though.

> > It would also be generally useful if aggressive VACUUMs were "less
> > aggressive" opportunistically (e.g. by being responsive to query
> > cancellations when the risk of wraparound failure is still very low).
>
> Being canceleable is already a different concept than anti-wraparound
> vacuums. We start aggressive autovacuums at vacuum_freeze_table_age, but
> anti-wrap only at autovacuum_freeze_max_age.

You know what I meant. Also, did *you* mean "being canceleable is
already a different concept to *aggressive* vacuums"? :-)

> The problem is that the
> autovacuum scheduling is way too naive for that to be a significant benefit -
> nothing tries to schedule autovacuums so that they have a chance to complete
> before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is
> to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum.

Not sure what you mean about scheduling, since vacuum_freeze_table_age
is only in place to make overnight (off hours low activity scripted
VACUUMs) freeze tuples before any autovacuum worker gets the chance
(since the latter may run at a much less convenient time). Sure,
vacuum_freeze_table_age might also force a regular autovacuum worker
to do an aggressive VACUUM -- but I think it's mostly intended for a
manual overnight VACUUM. Not usually very helpful, but also not
harmful.

Oh, wait. I think that you're talking about how autovacuum workers in
particular tend to be affected by this. We launch an av worker that
wants to clean up bloat, but it ends up being aggressive (and maybe
taking way longer), perhaps quite randomly, only due to
vacuum_freeze_table_age (not due to autovacuum_freeze_max_age). Is
that it?

> This is one of the most embarassing issues around the whole anti-wrap
> topic. We kind of define it as an emergency that there's an anti-wraparound
> vacuum. But we have *absolutely no mechanism* to prevent them from occurring.

What do you mean? Only an autovacuum worker can do an anti-wraparound
VACUUM (which is not quite the same thing as an aggressive VACUUM).

I agree that anti-wraparound autovacuum is way too unfriendly, though.

> > We now also collect LP_DEAD items in the dead_tuples array in the case
> > where we cannot immediately get a cleanup lock on the buffer. We cannot
> > prune without a cleanup lock, but opportunistic pruning may well have
> > left some LP_DEAD items behind in the past -- no reason to miss those.
>
> This has become *much* more important with the changes around deciding when to
> index vacuum. It's not just that opportunistic pruning could have left LP_DEAD
> items, it's that a previous vacuum is quite likely to have left them there,
> because the previous vacuum decided not to perform index cleanup.

I haven't seen any evidence of that myself (with the optimization
added to Postgres 14 by commit 5100010ee4). I still don't understand
why you doubted that work so much. I'm not saying that you're wrong
to; I'm saying that I don't think that I understand your perspective
on it.

What I have seen in my own tests (particularly with BenchmarkSQL) is
that most individual tables either never apply the optimization even
once (because the table reliably has heap pages with many more LP_DEAD
items than the 2%-of-relpages threshold), or will never need to
(because there are precisely zero LP_DEAD items anyway). Remaining
tables that *might* use the optimization tend to not go very long
without actually getting a round of index vacuuming. It's just too
easy for updates (and even aborted xact inserts) to introduce new
LP_DEAD items for us to go long without doing index vacuuming.

If you can be more concrete about a problem you've seen, then I might
be able to help. It's not like there are no options in this already. I
already thought about introducing a small degree of randomness into
the process of deciding to skip or to not skip (in the
consider_bypass_optimization path of lazy_vacuum() on Postgres 14).
The optimization is mostly valuable because it allows us to do more
useful work in VACUUM -- not because it allows us to do less useless
work in VACUUM. In particular, it allows to tune
autovacuum_vacuum_insert_scale_factor very aggressively with an
append-only table, without useless index vacuuming making it all but
impossible for autovacuum to get to the useful work.

> > Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
> > technique is independently capable of cleaning up line pointer bloat),
>
> One thing we could do around this, btw, would be to aggressively replace
> LP_REDIRECT items with their target item. We can't do that in all situations
> (somebody might be following a ctid chain), but I think we have all the
> information needed to do so. Probably would require a new HTSV RECENTLY_LIVE
> state or something like that.

Another idea is to truncate the line pointer during pruning (including
opportunistic pruning). Matthias van de Meent has a patch for that.

I am not aware of a specific workload where the patch helps, but that
doesn't mean that there isn't one, or that it doesn't matter. It's
subtle enough that I might have just missed something. I *expect* the
true damage over time to be very hard to model or understand -- I
imagine the potential for weird feedback loops is there.

> I think that'd be quite a win - we right now often "migrate" to other pages
> for modifications not because we're out of space on a page, but because we run
> out of itemids (for debatable reasons MaxHeapTuplesPerPage constraints the
> number of line pointers, not just the number of actual tuples). Effectively
> doubling the number of available line item in common cases in a number of
> realistic / common scenarios would be quite the win.

I believe Masahiko is working on this in the current cycle. It would
be easier if we had a better sense of how increasing
MaxHeapTuplesPerPage will affect tidbitmap.c. But the idea of
increasing that seems sound to me.

> > Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
> > since there is no barely any real practical sense in which we actually
> > miss doing useful work for these pages. Besides, this information
> > always seemed to have little practical value, even to Postgres hackers.
>
> -0.5. I think it provides some value, and I don't see why the removal of the
> information should be tied to this change. It's hard to diagnose why some dead
> tuples aren't cleaned up - a common cause for that on smaller tables is that
> nearly all pages are pinned nearly all the time.

Is that still true, though? If it turns out that we need to leave it
in, then I can do that. But I'd prefer to wait until we have more
information before making a final decision. Remember, the high level
idea of this whole patch is that we do as much work as possible for
any scanned_pages, which now includes pages that we never successfully
acquired a cleanup lock on. And so we're justified in assuming that
they're exactly equivalent to pages that we did get a cleanup on --
that's now the working assumption. I know that that's not literally
true, but that doesn't mean it's not a useful fiction -- it should be
very close to the truth.

Also, I would like to put more information (much more useful
information) in the same log output. Perhaps that will be less
controversial if I take something useless away first.

> I wonder if we could have a more restrained version of heap_page_prune() that
> doesn't require a cleanup lock? Obviously we couldn't defragment the page, but
> it's not immediately obvious that we need it if we constrain ourselves to only
> modify tuple versions that cannot be visible to anybody.
>
> Random note: I really dislike that we talk about cleanup locks in some parts
> of the code, and super-exclusive locks in others :(.

Somebody should normalize that.

> > + /*
> > + * Aggressive VACUUM (which is the same thing as anti-wraparound
> > + * autovacuum for most practical purposes) exists so that we'll reliably
> > + * advance relfrozenxid and relminmxid sooner or later. But we can often
> > + * opportunistically advance them even in a non-aggressive VACUUM.
> > + * Consider if that's possible now.
>
> I don't agree with the "most practical purposes" bit. There's a huge
> difference because manual VACUUMs end up aggressive but not anti-wrap once
> older than vacuum_freeze_table_age.

Okay.

> > + * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
> > + * the rel_pages used by lazy_scan_prune, from before a possible relation
> > + * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
> > + */
>
> I think it should be doable to add an isolation test for this path. There have
> been quite a few bugs around the wider topic...

I would argue that we already have one -- vacuum-reltuples.spec. I had
to update its expected output in the patch. I would argue that the
behavioral change (count tuples on a pinned-by-cursor heap page) that
necessitated updating the expected output for the test is an
improvement overall.

> > + {
> > + /* Can safely advance relfrozen and relminmxid, too */
> > + Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
> > + orig_rel_pages);
> > + vac_update_relstats(rel, new_rel_pages, new_live_tuples,
> > + new_rel_allvisible, vacrel->nindexes > 0,
> > + FreezeLimit, MultiXactCutoff, false);
> > + }
>
> I wonder if this whole logic wouldn't become easier and less fragile if we
> just went for maintaining the "actually observed" horizon while scanning the
> relation. If we skip a page via VM set the horizon to invalid. Otherwise we
> can keep track of the accurate horizon and use that. No need to count pages
> and stuff.

There is no question that that makes sense as an optimization -- my
prototype convinced me of that already. But I don't think that it can
simplify anything (not even the call to vac_update_relstats itself, to
actually update relfrozenxid at the end). Fundamentally, this will
only work if we decide to only skip all-frozen pages, which (by
definition) only happens within aggressive VACUUMs. Isn't it that
simple?

You recently said (on the heap-pruning-14-bug thread) that you don't
think it would be practical to always set a page all-frozen when we
see that we're going to set it all-visible -- apparently you feel that
we could never opportunistically freeze early such that all-visible
but not all-frozen pages practically cease to exist. I'm still not
sure why you believe that (though you may be right, or I might have
misunderstood, since it's complicated). It would certainly benefit
this dynamic relfrozenxid business if it was possible, though. If we
could somehow make that work, then almost every VACUUM would be able
to advance relfrozenxid, independently of aggressive-ness -- because
we wouldn't have any all-visible-but-not-all-frozen pages to skip
(that important detail wouldn't be left to chance).

> > - if (skipping_blocks && !FORCE_CHECK_PAGE())
> > + if (skipping_blocks &&
> > + !(blkno == nblocks - 1 && should_attempt_truncation(vacrel)))
> > {
> > /*
> > * Tricky, tricky. If this is in aggressive vacuum, the page
>
> I find the FORCE_CHECK_PAGE macro decidedly unhelpful. But I don't like
> mixing such changes within a larger change doing many other things.

I got rid of FORCE_CHECK_PAGE() itself in this patch (not a later
patch) because the patch also removes the only other
FORCE_CHECK_PAGE() call -- and the latter change is very much in scope
for the big patch (can't be broken down into smaller changes, I
think). And so this felt natural to me. But if you prefer, I can break
it out into a separate commit.

> > @@ -1204,156 +1214,52 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
> >
> > buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
> > RBM_NORMAL, vacrel->bstrategy);
> > + page = BufferGetPage(buf);
> > + vacrel->scanned_pages++;
>
> I don't particularly like doing BufferGetPage() before holding a lock on the
> page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd
> be good to have a crosscheck that BufferGetPage() is only allowed when holding
> a page level lock.

I have occasionally wondered if the whole idea of reading heap pages
with only a pin (and having cleanup locks in VACUUM) is really worth
it -- alternative designs seem possible. Obviously that's a BIG
discussion, and not one to have right now. But it seems kind of
relevant.

Since it is often legit to read a heap page without a buffer lock
(only a pin), I can't see why BufferGetPage() without a buffer lock
shouldn't also be okay -- if anything it seems safer. I think that I
would agree with you if it wasn't for that inconsistency (which is
rather a big "if", to be sure -- even for me).

> > + /* Check for new or empty pages before lazy_scan_noprune call */
> > + if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
> > + vmbuffer))
> > {
> > - ReleaseBuffer(buf);
> > - vacrel->pinskipped_pages++;
> > + /* Lock and pin released for us */
> > + continue;
> > + }
>
> Why isn't this done in lazy_scan_noprune()?

No reason, really -- could be done that way (we'd then also give
lazy_scan_prune the same treatment). I thought that it made a certain
amount of sense to keep some of this in the main loop, but I can
change it if you want.

> > + if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup))
> > + {
> > + /* No need to wait for cleanup lock for this page */
> > + UnlockReleaseBuffer(buf);
> > + if (hastup)
> > + vacrel->nonempty_pages = blkno + 1;
> > continue;
> > }
>
> Do we really need all of buf, blkno, page for both of these functions? Quite
> possible that yes, if so, could we add an assertion that
> BufferGetBockNumber(buf) == blkno?

This just matches the existing lazy_scan_prune function (which doesn't
mean all that much, since it was only added in Postgres 14). Will add
the assertion to both.

> > + /* Check for new or empty pages before lazy_scan_prune call */
> > + if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
> > {
>
> Maybe worth a note mentioning that we need to redo this even in the aggressive
> case, because we didn't continually hold a lock on the page?

Isn't that obvious? Either way it isn't the kind of thing that I'd try
to optimize away. It's such a narrow issue.

> > +/*
> > + * Empty pages are not really a special case -- they're just heap pages that
> > + * have no allocated tuples (including even LP_UNUSED items). You might
> > + * wonder why we need to handle them here all the same. It's only necessary
> > + * because of a rare corner-case involving a hard crash during heap relation
> > + * extension. If we ever make relation-extension crash safe, then it should
> > + * no longer be necessary to deal with empty pages here (or new pages, for
> > + * that matter).
>
> I don't think it's actually that rare - the window for this is huge.

I can just remove the comment, though it still makes sense to me.

> I don't really see that as a realistic thing to ever reliably avoid, FWIW. I
> think the overhead would be prohibitive. We'd need to do synchronous WAL
> logging while holding the extension lock I think. Um, not fun.

My long term goal for the FSM (the lease based design I talked about
earlier this year) includes soft ownership of free space from
preallocated pages by individual xacts -- the smgr layer itself
becomes transactional and crash safe (at least to a limited degree).
This includes bulk extension of relations, to make up for the new
overhead implied by crash safe rel extension. I don't think that we
should require VACUUM (or anything else) to be cool with random
uninitialized pages -- to me that just seems backwards.

We can't do true bulk extension right now (just an inferior version
that doesn't give specific pages to specific backends) because the
risk of losing a bunch of empty pages for way too long is not
acceptable. But that doesn't seem fundamental to me -- that's one of
the things we'd be fixing at the same time (through what I call soft
ownership semantics). I think we'd come out ahead on performance, and
*also* have a more robust approach to relation extension.

> > + * Caller can either hold a buffer cleanup lock on the buffer, or a simple
> > + * shared lock.
> > + */
>
> Kinda sounds like it'd be incorrect to call this with an exclusive lock, which
> made me wonder why that could be true. Perhaps just say that it needs to be
> called with at least a shared lock?

Okay.

> > +static bool
> > +lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
> > + Page page, bool sharelock, Buffer vmbuffer)
>
> It'd be good to document the return value - for me it's not a case where it's
> so obvious that it's not worth it.

Okay.

> > +/*
> > + * lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
> > + *
> > + * Caller need only hold a pin and share lock on the buffer, unlike
> > + * lazy_scan_prune, which requires a full cleanup lock.
>
> I'd add somethign like "returns whether a cleanup lock is required". Having to
> read multiple paragraphs to understand the basic meaning of the return value
> isn't great.

Will fix.

> > + if (ItemIdIsRedirected(itemid))
> > + {
> > + *hastup = true; /* page won't be truncatable */
> > + continue;
> > + }
>
> It's not really new, but this comment is now a bit confusing, because it can
> be understood to be about PageTruncateLinePointerArray().

I didn't think of that. Will address it in the next version.

> Why does it make sense to track DEAD tuples this way? Isn't that going to lead
> to counting them over-and-over again? I think it's quite misleading to include
> them in "dead bot not yet removable".

Compared to what? Do we really want to invent a new kind of DEAD tuple
(e.g., to report on), just to handle this rare case?

I accept that this code is lying about the tuples being RECENTLY_DEAD,
kind of. But isn't it still strictly closer to the truth, compared to
HEAD? Counting it as RECENTLY_DEAD is far closer to the truth than not
counting it at all.

Note that we don't remember LP_DEAD items here, either (not here, in
lazy_scan_noprune, and not in lazy_scan_prune on HEAD). Because we
pretty much interpret LP_DEAD items as "future LP_UNUSED items"
instead -- we make a soft assumption that we're going to go on to mark
the same items LP_UNUSED during a second pass over the heap. My point
is that there is no natural way to count "fully DEAD tuple that
autovacuum didn't deal with" -- and so I picked RECENTLY_DEAD.

> > + /*
> > + * Now save details of the LP_DEAD items from the page in the dead_tuples
> > + * array iff VACUUM uses two-pass strategy case
> > + */
>
> Do we really need to have separate code for this in lazy_scan_prune() and
> lazy_scan_noprune()?

There is hardly any repetition, though.

> > + }
> > + else
> > + {
> > + /*
> > + * We opt to skip FSM processing for the page on the grounds that it
> > + * is probably being modified by concurrent DML operations. Seems
> > + * best to assume that the space is best left behind for future
> > + * updates of existing tuples. This matches what opportunistic
> > + * pruning does.
>
> Why can we assume that there concurrent DML rather than concurrent read-only
> operations? IME it's much more common for read-only operations to block
> cleanup locks than read-write ones (partially because the frequency makes it
> easier, partially because cursors allow long-held pins, partially because the
> EXCLUSIVE lock of a r/w operation wouldn't let us get here)

I actually agree. It still probably isn't worth dealing with the FSM
here, though. It's just too much mechanism for too little benefit in a
very rare case. What do you think?

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2021-11-23 01:12:14 Re: Feature Proposal: Connection Pool Optimization - Change the Connection User
Previous Message Peter Smith 2021-11-23 01:06:01 PG docs - CREATE PUBLICATION publish_via_partition_root