Re: 64-bit XIDs in deleted nbtree pages

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: 64-bit XIDs in deleted nbtree pages
Date: 2021-02-14 06:47:13
Message-ID: CAH2-Wzk76_P=67iUscb1UN44-gyZL-KgpsXbSxq_bdcMa7Q+wQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 12, 2021 at 9:04 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > I agree that there already are huge problems in that case. But I think
> > we need to consider an append-only case as well; after bulk deletion
> > on an append-only table, vacuum deletes heap tuples and index tuples,
> > marking some index pages as dead and setting an XID into btpo.xact.
> > Since we trigger autovacuums even by insertions based on
> > autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
> > the table again. But if there is a long-running query a "wasted"
> > cleanup scan could happen many times depending on the values of
> > autovacuum_vacuum_insert_scale_factor/threshold and
> > vacuum_cleanup_index_scale_factor. This should not happen in the old
> > code. I agree this is DBA problem but it also means this could bring
> > another new problem in a long-running query case.
>
> I see your point.

My guess is that this concern of yours is somehow related to how we do
deletion and recycling *in general*. Currently (and even in v3 of the
patch), we assume that recycling the pages that a VACUUM operation
deletes will happen "eventually". This kind of makes sense when you
have "typical vacuuming" -- deletes/updates, and no big bursts, rare
bulk deletes, etc.

But when you do have a mixture of different triggering positions,
which is quite possible, it is difficult to understand what
"eventually" actually means...

> BTW, I am thinking about making recycling take place for pages that
> were deleted during the same VACUUM. We can just use a
> work_mem-limited array to remember a list of blocks that are deleted
> but not yet recyclable (plus the XID found in the block).

...which brings me back to this idea.

I've prototyped this. It works really well. In most cases the
prototype makes VACUUM operations with nbtree index page deletions
also recycle the pages that were deleted, at the end of the
btvacuumscan(). We do very little or no "indefinite deferring" work
here. This has obvious advantages, of course, but it also has a
non-obvious advantage: the awkward question of concerning "what
eventually actually means" with mixed triggering conditions over time
mostly goes away. So perhaps this actually addresses your concern,
Masahiko.

I've been testing this with BenchmarkSQL [1], which has several
indexes that regularly need page deletions. There is also a realistic
"life cycle" to the data in these indexes. I added custom
instrumentation to display information about what's going on with page
deletion when the benchmark is run. I wrote a quick-and-dirty patch
that makes log_autovacuum show the same information that you see about
index page deletion when VACUUM VERBOSE is run (including the new
pages_newly_deleted field from my patch). With this particular
TPC-C/BenchmarkSQL workload, VACUUM seems to consistently manage to go
on to place every page that it deletes in the FSM without leaving
anything to the next VACUUM. There are a very small number of
exceptions where we "only" manage to recycle maybe 95% of the pages
that were deleted.

The race condition that nbtree avoids by deferring recycling was
always a narrow one, outside of the extremes -- the way we defer has
always been overkill. It's almost always unnecessary to delay placing
deleted pages in the FSM until the *next* VACUUM. We only have to
delay it until the end of the *same* VACUUM -- why wait until the next
VACUUM if we don't have to? In general this deferring recycling
business has nothing to do with MVCC/GC/whatever, and yet the code
seems to suggest that it does. While it is convenient to use an XID
for page deletion and recycling as a way of implementing what Lanin &
Shasha call "the drain technique" [2], all we have to do is prevent
certain race conditions. This is all about the index itself, the data
structure, how it is maintained -- nothing more. It almost seems
obvious to me.

It's still possible to imagine extremes. Extremes that even the "try
to recycle pages we ourselves deleted when we reach the end of
btvacuumscan()" version of my patch cannot deal with. Maybe it really
is true that it's inherently impossible to recycle a deleted page even
at the end of a VACUUM -- maybe a long-running transaction (that could
in principle have a stale link to our deleted page) starts before we
VACUUM, and lasts after VACUUM finishes. So it's just not safe. When
that happens, we're back to having the original problem: we're relying
on some *future* VACUUM operation to do that for us at some indefinite
point in the future. It's fair to wonder: What are the implications of
that? Are we not back to square one? Don't we have the same "what does
'eventually' really mean" problem once again?

I think that that's okay, because this remaining case is a *truly*
extreme case (especially with a large index, where index vacuuming
will naturally take a long time).

It will be rare. But more importantly, the fact that scenario is now
an extreme case justifies treating it as an extreme case. We can teach
_bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In
particular, I think that it will now be okay to increase the threshold
applied when considering deleted pages inside
_bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the
patch. But in v4, which has the new recycling enhancement, I think
that it would be sensible to make it 5%, or maybe even 10%. This
naturally makes Masahiko's problem scenario unlikely to actually
result in a truly wasted call to btvacuumscan(). The number of pages
that the metapage indicates are "deleted but not yet placed in the
FSM" will be close to the theoretical minimum, because we're no longer
naively throwing away information about which specific pages will be
recyclable soon. Which is what the current approach does, really.

[1] https://github.com/wieck/benchmarksql
[2] https://archive.org/stream/symmetricconcurr00lani#page/8/mode/2up
-- see "2.5 Freeing Empty Nodes"
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2021-02-14 11:06:57 Re: Asynchronous Append on postgres_fdw nodes.
Previous Message Peter Geoghegan 2021-02-14 05:02:12 Re: 64-bit XIDs in deleted nbtree pages