Re: 64-bit XIDs in deleted nbtree pages

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: 64-bit XIDs in deleted nbtree pages
Date: 2021-02-16 03:26:03
Message-ID: CAH2-WzkP=9CUYmbCi7Yzopsx380kaPHi686Sy22Skhwv4nCfOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 15, 2021 at 3:15 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> Yes. I think this would simplify the problem by resolving almost all
> problems related to indefinite deferring page recycle.
>
> We will be able to recycle almost all just-deleted pages in practice
> especially when btvacuumscan() took a long time. And there would not
> be a noticeable downside, I think.

Great!

> BTW if btree index starts to use maintenan_work_mem for this purpose,
> we also need to set amusemaintenanceworkmem to true which is
> considered when parallel vacuum.

I was just going to use work_mem. This should be okay. Note that
CREATE INDEX uses an additional work_mem allocation when building a
unique index, for the second spool/tuplesort. That seems like a
precedent that I can follow here.

Right now the BTPendingRecycle struct the patch uses to store
information about a page that the current VACUUM deleted (and may yet
be able to place in the FSM) are each 16 bytes (including alignment
overhead). I could probably make them smaller with a little work, but
even now that's quite small. Even with the default 4MiB work_mem
setting we can fit information about 262144 pages all at once. That's
2GiB worth of deleted index pages, which is generally much more than
we'll need.

> Yeah, increasing the threshold would solve the problem in most cases.
> Given that nbtree index page deletion is unlikely to happen in
> practice, having the threshold 5% or 10% seems to avoid the problem in
> nearly 100% of cases, I think.

Of course it all depends on workload/index characteristics, in the
end. It is very rare to delete a percentage of the index that exceeds
autovacuum_vacuum_scale_factor -- that's the important thing here IMV.

> Another idea I come up with (maybe on top of above your idea) is to
> change btm_oldest_btpo_xact to 64-bit XID and store the *newest*
> btpo.xact XID among all deleted pages when the total amount of deleted
> pages exceeds 2% of index. That way, we surely can recycle more than
> 2% of index when the XID becomes older than the global xmin.

You could make my basic approach to recycling deleted pages earlier
(ideally at the end of the same btvacuumscan() that deleted the pages
in the first place) more sophisticated in a variety of ways. These are
all subject to diminishing returns, though.

I've already managed to recycle close to 100% of all B-Tree pages
during the same VACUUM with a very simple approach -- at least if we
assume BenchmarkSQL is representative. It is hard to know how much
more effort can be justified. To be clear, I'm not saying that an
improved design cannot be justified now or in the future (BenchmarkSQL
is not like many workloads that people use Postgres for). I'm just
saying that I *don't know* where to draw the line. Any particular
place that we draw the line feels a little arbitrary to me. This
includes my own choice of the work_mem-limited BTPendingRecycle array.
My patch currently works that way because it's simple -- no other
reason.

Any scheme to further improve the "work_mem-limited BTPendingRecycle
array" design from my patch boils down to this: A new approach that
makes recycling of any remaining deleted pages take place "before too
long": After the end of the btvacuumscan() BTPendingRecycle array
stuff (presumably that didn't work out in cases where an improved
approach matters), but before the next VACUUM takes place (since that
will do the required recycling anyway, unless it's unable to do any
work at all, in which case it hardly matters). Here are two ideas of
my own in this same class as your idea:

1. Remember to do some of the BTPendingRecycle array FSM processing
stuff in btvacuumcleanup() -- defer some of the recycling of pages
recorded in BTPendingRecycle entries (paged deleted during
btbulkdelete() for the same VACUUM) until btvacuumcleanup() is called.

Right now btvacuumcleanup() will always do nothing when btbulkdelete()
was called earlier. But that's just a current nbtree convention, and
is no reason to not do this (we don't have to scan the index again at
all). The advantage of teaching btvacuumcleanup() to do this is that
it delays the "BTPendingRecycle array FSM processing" stuff until the
last moment that it is still easy to use the in-memory array (because
we haven't freed it yet). In general, doing it later makes it more
likely that we'll successfully recycle the pages. Though in general it
might not make any difference -- so we're still hoping that the
workload allows us to recycle everything we deleted, without making
the design much more complicated than what I posted already.

(BTW I see that you reviewed commit 4e514c61, so you must have thought
about the trade-off between doing deferred recycling in
amvacuumcleanup() vs ambulkdelete(), when to call
IndexFreeSpaceMapVacuum(), etc. But there is no reason why we cannot
implement this idea while calling IndexFreeSpaceMapVacuum() during
both btvacuumcleanup() and btbulkdelete(), so that we get the best of
both worlds -- fast recycling *and* more delayed processing that is
more likely to ultimately succeed.)

2. Remember/serialize the BTPendingRecycle array when we realize that
we cannot put all recyclable pages in the FSM at the end of the
current btvacuumscan(), and then use an autovacuum work item to
process them before too long -- a call to AutoVacuumRequestWork()
could even serialize the data on disk.

Idea 2 has the advantage of allowing retries -- eventually it will be
safe to recycle the pages, if we just wait long enough.

Anyway, I'm probably not going to pursue either of the 2 ideas for
Postgres 14. I'm mentioning these ideas now because the trade-offs
show that there is no perfect design for this deferring recycling
stuff. Whatever we do, we should accept that there is no perfect
design.

Actually, there is one more reason why I bring up idea 1 now: I want
to hear your thoughts on the index AM API questions now, which idea 1
touches on. Ideally all of the details around the index AM VACUUM APIs
(i.e. when and where the extra work happens -- btvacuumcleanup() vs
btbulkdelete()) won't need to change much in the future. I worry about
getting this index AM API stuff right, at least a little.

> Also, maybe we can record deleted pages to FSM even without deferring
> and check it when re-using. That is, when we get a free page from FSM
> we check if the page is really recyclable (maybe _bt_getbuf() already
> does this?). IOW, a deleted page can be recycled only when it's
> requested to be reused. If btpo.xact is 64-bit XID we never need to
> worry about the case where a deleted page never be requested to be
> reused.

I've thought about that too (both now and in the past). You're right
about _bt_getbuf() -- it checks the XID, at least on the master
branch. I took that XID check out in v4 of the patch, but I am now
starting to have my doubts about that particular choice. (I'm probably
going to restore the XID check in _bt_getbuf in v5 of the patch.)

I took the XID-is-recyclable check out in v4 of the patch because it
might leak pages in rare cases -- which is not a new problem.
_bt_getbuf() currently has a remarkably relaxed attitude about leaking
pages from the FSM (it is more relaxed about it than I am, certainly)
-- but why should we just accept leaking pages like that? My new
doubts about it are non-specific, though. We know that the FSM isn't
crash safe -- but I think that that reduces to "practically speaking,
we can never 100% trust the FSM". Which makes me nervous. I worry that
the FSM can do something completely evil and crazy in rare cases.

It's not just crash safety. The FSM's fsm_search_avail() function
currently changes the fp_next_slot field with only a shared buffer
lock held. It's an int, which is supposed to "be atomic on most
platforms". But we should be using real atomic ops. So the FSM is
generally...kind of wonky.

In an ideal world, nbtree page deletion + recycling would have crash
safety built in. I don't think that it makes sense to not have free
space management without crash safety in the case of index AMs,
because it's just not worth it with whole-page units of free space
(heapam is another story). A 100% crash-safe design would naturally
shift the problem of nbtree page recycle safety from the
producer/VACUUM side, to the consumer/_bt_getbuf() side, which I agree
would be a real improvement. But these long standing FSM issues are
not going to change for Postgres 14. And so changing _bt_getbuf() to
do clever things with XIDs won't be possible for Postgres 14 IMV.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-02-16 03:37:21 Re: logical replication seems broken
Previous Message Ian Lawrence Barwick 2021-02-16 02:18:47 Re: [DOC] add missing "[ NO ]" to various "DEPENDS ON" synopses