Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Date: 2022-01-30 04:42:36
Message-ID: CAH2-Wz=iLnf+0CsaB37efXCGMRJO1DyJw5HMzm7tp1AxG1NR2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 20, 2022 at 2:00 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I do see some value in that, too. Though it's not going to be a way of
> turning off the early freezing stuff, which seems unnecessary (though
> I do still have work to do on getting the overhead for that down).

Attached is v7, a revision that overhauls the algorithm that decides
what to freeze. I'm now calling it block-driven freezing in the commit
message. Also included is a new patch, that makes VACUUM record zero
free space in the FSM for an all-visible page, unless the total amount
of free space happens to be greater than one half of BLCKSZ.

The fact that I am now including this new FSM patch (v7-0006-*patch)
may seem like a case of expanding the scope of something that could
well do without it. But hear me out! It's true that the new FSM patch
isn't essential. I'm including it now because it seems relevant to the
approach taken with block-driven freezing -- it may even make my
general approach easier to understand. The new approach to freezing is
to freeze every tuple on a block that is about to be set all-visible
(and thus set it all-frozen too), or to not freeze anything on the
page at all (at least until one XID gets really old, which should be
rare). This approach has all the benefits that I described upthread,
and a new benefit: it effectively encourages the application to allow
pages to "become settled".

The main difference in how we freeze here (relative to v6 of the
patch) is that I'm *not* freezing a page just because it was
dirtied/pruned. I now think about freezing as an essentially
page-level thing, barring edge cases where we have to freeze
individual tuples, just because the XIDs really are getting old (it's
an edge case when we can't freeze all the tuples together due to a mix
of new and old, which is something we specifically set out to avoid
now).

Freezing whole pages
====================

When VACUUM sees that all remaining/unpruned tuples on a page are
all-visible, it isn't just important because of cost control
considerations. It's deeper than that. It's also treated as a
tentative signal from the application itself, about the data itself.
Which is: this page looks "settled" -- it may never be updated again,
but if there is an update it likely won't change too much about the
whole page. Also, if the page is ever updated in the future, it's
likely that that will happen at a much later time than you should
expect for those *other* nearby pages, that *don't* appear to be
settled. And so VACUUM infers that the page is *qualitatively*
different to these other nearby pages. VACUUM therefore makes it hard
(though not impossible) for future inserts or updates to disturb these
settled pages, via this FSM behavior -- it is short sighted to just
see the space remaining on the page as free space, equivalent to any
other. This holistic approach seems to work well for
TPC-C/BenchmarkSQL, and perhaps even in general. More on TPC-C below.

This is not unlike the approach taken by other DB systems, where free
space management is baked into concurrency control, and the concept of
physical data independence as we know it from Postgres never really
existed. My approach also seems related to the concept of a "tenured
generation", which is key to generational garbage collection. The
whole basis of generational garbage collection is the generational
hypothesis: "most objects die young". This is an empirical observation
about applications written in GC'd programming languages actually
behave, not a rigorous principle, and yet in practice it appears to
always hold. Intuitively, it seems to me like the hypothesis must work
in practice because if it didn't then a counterexample nemesis
application's behavior would be totally chaotic, in every way.
Theoretically possible, but of no real concern, since the program
makes zero practical sense *as an actual program*. A Java program must
make sense to *somebody* (at least the person that wrote it), which,
it turns out, helpfully constrains the space of possibilities that any
industrial strength GC implementation needs to handle well.

The same principles seem to apply here, with VACUUM. Grouping logical
rows into pages that become their "permanent home until further
notice" may be somewhat arbitrary, at first, but that doesn't mean it
won't end up sticking. Just like with generational garbage collection,
where the application isn't expected to instruct the GC about its
plans for memory that it allocates, that can nevertheless be usefully
organized into distinct generations through an adaptive process.

Second order effects
====================

Relating the FSM to page freezing/all-visible setting makes much more
sense if you consider the second order effects.

There is bound to be competition for free space among backends that
access the free space map. By *not* freezing a page during VACUUM
because it looks unsettled, we make its free space available in the
traditional way instead. It follows that unsettled pages (in tables
with lots of updates) are now the only place that backends that need
more free space from the FSM can look -- unsettled pages therefore
become a hot commodity, freespace-wise. A page that initially appeared
"unsettled", that went on to become settled in this newly competitive
environment might have that happen by pure chance -- but probably not.
It *could* happen by chance, of course -- in which case the page will
get dirtied again, and the cycle continues, for now. There will be
further opportunities to figure it out, and freezing the tuples on the
page "prematurely" still has plenty of benefits.

Locality matters a lot, obviously. The goal with the FSM stuff is
merely to make it *possible* for pages to settle naturally, to the
extent that we can. We really just want to avoid hindering a naturally
occurring process -- we want to avoid destroying naturally occuring
locality. We must be willing to accept some cost for that. Even if it
takes a few attempts for certain pages, constraining the application's
choice of where to get free space from (can't be a page marked
all-visible) allows pages to *systematically* become settled over
time.

The application is in charge, really -- not VACUUM. This is already
the case, whether we like it or not. VACUUM needs to learn to live in
that reality, rather than fighting it. When VACUUM considers a page
settled, and the physical page still has a relatively large amount of
free space (say 45% of BLCKSZ, a borderline case in the new FSM
patch), "losing" so much free space certainly is unappealing. We set
the free space to 0 in the free space map all the same, because we're
cutting our losses at that point. While the exact threshold I've
proposed is tentative, the underlying theory seems pretty sound to me.
The BLCKSZ/2 cutoff (and the way that it extends the general rules for
whole-page freezing) is intended to catch pages that are qualitatively
different, as well as quantitatively different. It is a balancing act,
between not wasting space, and the risk of systemic problems involving
excessive amounts of non-HOT updates that must move a successor
version to another page.

It's possible that a higher cutoff (for example a cutoff of 80% of
BLCKSZ, not 50%) will actually lead to *worse* space utilization, in
addition to the downsides from fragmentation -- it's far from a simple
trade-off. (Not that you should believe that 50% is special, it's just
a starting point for me.)

TPC-C
=====

I'm going to talk about a benchmark that ran throughout the week,
starting on Monday. Each run lasted 24 hours, and there were 2 runs in
total, for both the patch and for master/baseline. So this benchmark
lasted 4 days, not including the initial bulk loading, with databases
that were over 450GB in size by the time I was done (that's 450GB+ for
both the patch and master) . Benchmarking for days at a time is pretty
inconvenient, but it seems necessary to see certain effects in play.
We need to wait until the baseline/master case starts to have
anti-wraparound VACUUMs with default, realistic settings, which just
takes days and days.

I make available all of my data for the Benchmark in question, which
is way more information that anybody is likely to want -- I dump
anything that even might be useful from the system views in an
automated way. There are html reports for all 4 24 hour long runs.
Google drive link:

https://drive.google.com/drive/folders/1A1g0YGLzluaIpv-d_4o4thgmWbVx3LuR?usp=sharing

While the patch did well overall, and I will get to the particulars
towards the end of the email, I want to start with what I consider to
be the important part: the user/admin experience with VACUUM, and
VACUUM's performance stability. This is about making VACUUM less
scary.

As I've said several times now, with an append-only table like
pgbench_history we see a consistent pattern where relfrozenxid is set
to a value very close to the same VACUUM's OldestXmin value (even
precisely equal to OldestXmin) during each VACUUM operation, again and
again, forever -- that case is easy to understand and appreciate, and
has already been discussed. Now (with v7's new approach to freezing),
a related pattern can be seen in the case of the two big, troublesome
TPC-C tables, the orders and order lines tables.

To recap, these tables are somewhat like the history table, in that
new orders insert into both tables, again and again, forever. But they
also have one huge difference to simple append-only tables too, which
is the source of most of our problems with TPC-C. The difference is:
there are also delayed, correlated updates of each row from each
table. Exactly one such update per row for both tables, which takes
place hours after each order's insert, when the earlier order is
processed by TPC-C's delivery transaction. In the long run we need the
data to age out and not get re-dirtied, as the table grows and grows
indefinitely, much like with a simple append-only table. At the same
time, we don't want to have poor free space management for these
deferred updates. It's adversarial, sort of, but in a way that is
grounded in reality.

With the order and order lines tables, relfrozenxid tends to be
advanced up to the OldestXmin used by the *previous* VACUUM operation
-- an unmistakable pattern. I'll show you all of the autovacuum log
output for the orders table during the second 24 hour long benchmark
run:

2022-01-27 01:46:27 PST LOG: automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1205349 remain, 887225 skipped using visibility map
(73.61% of total)
tuples: 253872 removed, 134182902 remain (26482225 newly frozen),
27193 are dead but not yet removable
removable cutoff: 243783407, older by 728844 xids when operation ended
new relfrozenxid: 215400514, which is 26840669 xids ahead of previous value
...
2022-01-27 05:54:39 PST LOG: automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1345302 remain, 993924 skipped using visibility map
(73.88% of total)
tuples: 261656 removed, 150022816 remain (29757570 newly frozen),
29216 are dead but not yet removable
removable cutoff: 276319403, older by 826850 xids when operation ended
new relfrozenxid: 243838706, which is 28438192 xids ahead of previous value
...
2022-01-27 10:37:24 PST LOG: automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1504707 remain, 1110002 skipped using visibility map
(73.77% of total)
tuples: 316086 removed, 167990124 remain (33754949 newly frozen),
33326 are dead but not yet removable
removable cutoff: 313328445, older by 987732 xids when operation ended
new relfrozenxid: 276309397, which is 32470691 xids ahead of previous value
...
2022-01-27 15:49:51 PST LOG: automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1680649 remain, 1250525 skipped using visibility map
(74.41% of total)
tuples: 343946 removed, 187739072 remain (37346315 newly frozen),
38037 are dead but not yet removable
removable cutoff: 354149019, older by 1222160 xids when operation ended
new relfrozenxid: 313332249, which is 37022852 xids ahead of previous value
...
2022-01-27 21:55:34 PST LOG: automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1886336 remain, 1403800 skipped using visibility map
(74.42% of total)
tuples: 389748 removed, 210899148 remain (43453900 newly frozen),
45802 are dead but not yet removable
removable cutoff: 401955979, older by 1458514 xids when operation ended
new relfrozenxid: 354134615, which is 40802366 xids ahead of previous value

This mostly speaks for itself, I think. (Anybody that's interested can
drill down to the logs for order lines, which looks similar.)

The effect we see with the order/order lines table isn't perfectly
reliable. Actually, it depends on how you define it. It's possible
that we won't be able to acquire a cleanup lock on the wrong page at
the wrong time, and as a result fail to advance relfrozenxid by the
usual amount, once. But that effect appears to be both rare and of no
real consequence. One could reasonably argue that we never fell
behind, because we still did 99.9%+ of the required freezing -- we
just didn't immediately get to advance relfrozenxid, because of a
temporary hiccup on one page. We will still advance relfrozenxid by a
small amount. Sometimes it'll be by only hundreds of XIDs when
millions or tens of millions of XIDs were expected. Once we advance it
by some amount, we can reasonably suppose that the issue was just a
hiccup.

On the master branch, the first 24 hour period has no anti-wraparound
VACUUMs, and so looking at that first 24 hour period gives you some
idea of how worse off we are in the short term -- the freezing stuff
won't really start to pay for itself until the second 24 hour run with
these mostly-default freeze related settings. The second 24 hour run
on master almost exclusively has anti-wraparound VACUUMs for all the
largest tables, though -- all at the same time. And not just the first
time, either! This causes big spikes that the patch totally avoids,
simply by avoiding anti-wraparound VACUUMs. With the patch, there are
no anti-wraparound VACUUMs, barring tables that will never be vacuumed
for any other reason, where it's still inevitable, limited to the
stock table and customers table.

It was a mistake for me to emphasize "no anti-wraparound VACUUMs
outside pathological cases" before now. I stand by those statements as
accurate, but anti-wraparound VACUUMs should not have been given so
much emphasis. Let's assume that somehow we really were to get an
anti-wraparound VACUUM against one of the tables where that's just not
expected, like this orders table -- let's suppose that I got that part
wrong, in some way. It would hardly matter at all! We'd still have
avoided the freezing cliff during this anti-wraparound VACUUM, which
is the real benefit. Chances are good that we needed to VACUUM anyway,
just to clean any very old garbage tuples up -- relfrozenxid is now
predictive of the age of the oldest garbage tuples, which might have
been a good enough reason to VACUUM anyway. The stampede of
anti-wraparound VACUUMs against multiple tables seems like it would
still be fixed, since relfrozenxid now actually tells us something
about the table (as opposed to telling us only about what the user set
vacuum_freeze_min_age to). The only concerns that this leaves for me
are all usability related, and not of primary importance (e.g. do we
really need to make anti-wraparound VACUUMs non-cancelable now?).

TPC-C raw numbers
=================

The single most important number for the patch might be the decrease
in both buffer misses and buffer hits, which I believe is caused by
the patch being able to use index-only scans much more effectively
(with modifications to BenchmarkSQL to improve the indexing strategy
[1]). This is quite clear from pg_stat_database state at the end.

Patch:

xact_commit | 440,515,133
xact_rollback | 1,871,142
blks_read | 3,754,614,188
blks_hit | 174,551,067,731
tup_returned | 341,222,714,073
tup_fetched | 124,797,772,450
tup_inserted | 2,900,197,655
tup_updated | 4,549,948,092
tup_deleted | 165,222,130

Here is the same pg_stat_database info for master:

xact_commit | 440,402,505
xact_rollback | 1,871,536
blks_read | 4,002,682,052
blks_hit | 283,015,966,386
tup_returned | 346,448,070,798
tup_fetched | 237,052,965,901
tup_inserted | 2,899,735,420
tup_updated | 4,547,220,642
tup_deleted | 165,103,426

The blks_read is x0.938 of master/baseline for the patch -- not bad.
More importantly, blks_hit is x0.616 for the patch -- quite a
significant reduction in a key cost. Note that we start to get this
particular benefit for individual read queries pretty early on --
avoiding unsetting visibility map bits like this matters right from
the start. In TPC-C terms, the ORDER_STATUS transaction will have much
lower latency, particularly tail latency, since it uses index-only
scans to good effect. There are 5 distinct transaction types from the
benchmark, and an improvement to one particular transaction type isn't
unusual -- so you often have to drill down, and look at the full html
report. The latency situation is improved across the board with the
patch, by quite a bit, especially after the second run. This server
can sustain much more throughput than the TPC-C spec formally permits,
even though I've increased the TPM rate from the benchmark by 10x the
spec legal limit, so query latency is the main TPC-C metric of
interest here.

WAL
===

Then there's the WAL overhead. Like practically any workload, the WAL
consumption for this workload is dominated by FPIs, despite the fact
that I've tuned checkpoints reasonably well. The patch *does* write
more WAL in the first set of runs -- it writes a total of ~3.991 TiB,
versus ~3.834 TiB for master. In other words, during the first 24 hour
run (before the trouble with the anti-wraparound freeze cliff even
begins for the master branch), the patch writes x1.040 as much WAL in
total. The good news is that the patch comes out ahead by the end,
after the second set of 24 hour runs. By the time the second run
finishes, it's 8.332 TiB of WAL total for the patch, versus 8.409 TiB
for master, putting the patch at x0.990 in the end -- a small
improvement. I believe that most of the WAL doesn't get generated by
VACUUM here anyway -- opportunistic pruning works well for this
workload.

I expect to be able to commit the first 2 patches in a couple of
weeks, since that won't need to block on making the case for the final
3 or 4 patches from the patch series. The early stuff is mostly just
refactoring work that removes needless differences between aggressive
and non-aggressive VACUUM operations. It makes a lot of sense on its
own.

[1] https://github.com/pgsql-io/benchmarksql/pull/16
--
Peter Geoghegan

Attachment Content-Type Size
v7-0004-Loosen-coupling-between-relfrozenxid-and-tuple-fr.patch application/octet-stream 30.1 KB
v7-0005-Make-block-level-characteristics-drive-freezing.patch application/octet-stream 13.1 KB
v7-0006-Add-all-visible-FSM-heuristic.patch application/octet-stream 2.5 KB
v7-0003-Consolidate-VACUUM-xid-cutoff-logic.patch application/octet-stream 11.3 KB
v7-0002-Add-VACUUM-instrumentation-for-scanned-pages-relf.patch application/octet-stream 14.6 KB
v7-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patch application/octet-stream 46.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2022-01-30 05:01:41 Re: archive modules
Previous Message Aliaksandr Kalenik 2022-01-30 02:49:34 [PATCH] nodeindexscan with reorder memory leak