Re: 64-bit XIDs in deleted nbtree pages

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: 64-bit XIDs in deleted nbtree pages
Date: 2021-02-18 11:13:01
Message-ID: CAD21AoA9g=+oE9xgzJVzS62oi8Rx8_zqyDQ3BkaTLt+Qd+_M1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 16, 2021 at 12:26 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Mon, Feb 15, 2021 at 3:15 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > Yes. I think this would simplify the problem by resolving almost all
> > problems related to indefinite deferring page recycle.
> >
> > We will be able to recycle almost all just-deleted pages in practice
> > especially when btvacuumscan() took a long time. And there would not
> > be a noticeable downside, I think.
>
> Great!
>
> > BTW if btree index starts to use maintenan_work_mem for this purpose,
> > we also need to set amusemaintenanceworkmem to true which is
> > considered when parallel vacuum.
>
> I was just going to use work_mem. This should be okay. Note that
> CREATE INDEX uses an additional work_mem allocation when building a
> unique index, for the second spool/tuplesort. That seems like a
> precedent that I can follow here.
>
> Right now the BTPendingRecycle struct the patch uses to store
> information about a page that the current VACUUM deleted (and may yet
> be able to place in the FSM) are each 16 bytes (including alignment
> overhead). I could probably make them smaller with a little work, but
> even now that's quite small. Even with the default 4MiB work_mem
> setting we can fit information about 262144 pages all at once. That's
> 2GiB worth of deleted index pages, which is generally much more than
> we'll need.

Cool.

>
> > Yeah, increasing the threshold would solve the problem in most cases.
> > Given that nbtree index page deletion is unlikely to happen in
> > practice, having the threshold 5% or 10% seems to avoid the problem in
> > nearly 100% of cases, I think.
>
> Of course it all depends on workload/index characteristics, in the
> end. It is very rare to delete a percentage of the index that exceeds
> autovacuum_vacuum_scale_factor -- that's the important thing here IMV.
>
> > Another idea I come up with (maybe on top of above your idea) is to
> > change btm_oldest_btpo_xact to 64-bit XID and store the *newest*
> > btpo.xact XID among all deleted pages when the total amount of deleted
> > pages exceeds 2% of index. That way, we surely can recycle more than
> > 2% of index when the XID becomes older than the global xmin.
>
> You could make my basic approach to recycling deleted pages earlier
> (ideally at the end of the same btvacuumscan() that deleted the pages
> in the first place) more sophisticated in a variety of ways. These are
> all subject to diminishing returns, though.
>
> I've already managed to recycle close to 100% of all B-Tree pages
> during the same VACUUM with a very simple approach -- at least if we
> assume BenchmarkSQL is representative. It is hard to know how much
> more effort can be justified. To be clear, I'm not saying that an
> improved design cannot be justified now or in the future (BenchmarkSQL
> is not like many workloads that people use Postgres for). I'm just
> saying that I *don't know* where to draw the line. Any particular
> place that we draw the line feels a little arbitrary to me. This
> includes my own choice of the work_mem-limited BTPendingRecycle array.
> My patch currently works that way because it's simple -- no other
> reason.
>
> Any scheme to further improve the "work_mem-limited BTPendingRecycle
> array" design from my patch boils down to this: A new approach that
> makes recycling of any remaining deleted pages take place "before too
> long": After the end of the btvacuumscan() BTPendingRecycle array
> stuff (presumably that didn't work out in cases where an improved
> approach matters), but before the next VACUUM takes place (since that
> will do the required recycling anyway, unless it's unable to do any
> work at all, in which case it hardly matters).

I agreed with this direction.

> Here are two ideas of
> my own in this same class as your idea:
>
> 1. Remember to do some of the BTPendingRecycle array FSM processing
> stuff in btvacuumcleanup() -- defer some of the recycling of pages
> recorded in BTPendingRecycle entries (paged deleted during
> btbulkdelete() for the same VACUUM) until btvacuumcleanup() is called.
>
> Right now btvacuumcleanup() will always do nothing when btbulkdelete()
> was called earlier. But that's just a current nbtree convention, and
> is no reason to not do this (we don't have to scan the index again at
> all). The advantage of teaching btvacuumcleanup() to do this is that
> it delays the "BTPendingRecycle array FSM processing" stuff until the
> last moment that it is still easy to use the in-memory array (because
> we haven't freed it yet). In general, doing it later makes it more
> likely that we'll successfully recycle the pages. Though in general it
> might not make any difference -- so we're still hoping that the
> workload allows us to recycle everything we deleted, without making
> the design much more complicated than what I posted already.
>
> (BTW I see that you reviewed commit 4e514c61, so you must have thought
> about the trade-off between doing deferred recycling in
> amvacuumcleanup() vs ambulkdelete(), when to call
> IndexFreeSpaceMapVacuum(), etc. But there is no reason why we cannot
> implement this idea while calling IndexFreeSpaceMapVacuum() during
> both btvacuumcleanup() and btbulkdelete(), so that we get the best of
> both worlds -- fast recycling *and* more delayed processing that is
> more likely to ultimately succeed.)

I think this idea 1 also needs to serialize BTPendingRecycle array
somewhere to pass it to a parallel vacuum worker in parallel vacuum
case.

Delaying the "BTPendingRecycle array FSM processing" stuff until
btvacuumcleanup() is a good idea. But I think it's a relatively rare
case in practice where index vacuum runs more than once (i.g., using
up maintenance_work_mem). So considering the development cost of
serializing BTPendingRecycle array and index AM API changes,
attempting to recycle the deleted pages at the end of btvacuumscan()
would be a balanced strategy.

>
> 2. Remember/serialize the BTPendingRecycle array when we realize that
> we cannot put all recyclable pages in the FSM at the end of the
> current btvacuumscan(), and then use an autovacuum work item to
> process them before too long -- a call to AutoVacuumRequestWork()
> could even serialize the data on disk.
>
> Idea 2 has the advantage of allowing retries -- eventually it will be
> safe to recycle the pages, if we just wait long enough.

This is a good idea too. Perhaps autovacuum needs to end up with an
error to retry later again in case where it could not recycle all
deleted pages.

I have thought too about the idea to store pending-recycle pages
somewhere to avoid index scan when we do the XID-is-recyclable check.
My idea was to store them to btree pages dedicated for this purpose
linked from the meta page but I prefer your idea.

>
> Anyway, I'm probably not going to pursue either of the 2 ideas for
> Postgres 14. I'm mentioning these ideas now because the trade-offs
> show that there is no perfect design for this deferring recycling
> stuff. Whatever we do, we should accept that there is no perfect
> design.
>
> Actually, there is one more reason why I bring up idea 1 now: I want
> to hear your thoughts on the index AM API questions now, which idea 1
> touches on. Ideally all of the details around the index AM VACUUM APIs
> (i.e. when and where the extra work happens -- btvacuumcleanup() vs
> btbulkdelete()) won't need to change much in the future. I worry about
> getting this index AM API stuff right, at least a little.

After introducing parallel vacuum, index AMs are not able to pass
arbitary information taken in ambulkdlete() to amvacuumcleanup(), like
old gist index code does. If there is a good use case where needs to
pass arbitary information to amvacuumcleanup(), I think it'd be a good
idea to add an index AM API so that parallel vacuum serialize it and
tells another parallel vacuum worker. But, as I mentinoed above, given
that vacuum calls ambulkdelete() only once in most cases and I think
we’d like to improve how to store TIDs in maintenance_work_mem space
(discussed a little on thread[1]), delaying "the BTPendingRecycle
array FSM processing stuff in btvacuumcleanup()” would not be a good
usecase.

>
> > Also, maybe we can record deleted pages to FSM even without deferring
> > and check it when re-using. That is, when we get a free page from FSM
> > we check if the page is really recyclable (maybe _bt_getbuf() already
> > does this?). IOW, a deleted page can be recycled only when it's
> > requested to be reused. If btpo.xact is 64-bit XID we never need to
> > worry about the case where a deleted page never be requested to be
> > reused.
>
> I've thought about that too (both now and in the past). You're right
> about _bt_getbuf() -- it checks the XID, at least on the master
> branch. I took that XID check out in v4 of the patch, but I am now
> starting to have my doubts about that particular choice. (I'm probably
> going to restore the XID check in _bt_getbuf in v5 of the patch.)
>
> I took the XID-is-recyclable check out in v4 of the patch because it
> might leak pages in rare cases -- which is not a new problem.
> _bt_getbuf() currently has a remarkably relaxed attitude about leaking
> pages from the FSM (it is more relaxed about it than I am, certainly)
> -- but why should we just accept leaking pages like that? My new
> doubts about it are non-specific, though. We know that the FSM isn't
> crash safe -- but I think that that reduces to "practically speaking,
> we can never 100% trust the FSM". Which makes me nervous. I worry that
> the FSM can do something completely evil and crazy in rare cases.
>
> It's not just crash safety. The FSM's fsm_search_avail() function
> currently changes the fp_next_slot field with only a shared buffer
> lock held. It's an int, which is supposed to "be atomic on most
> platforms". But we should be using real atomic ops. So the FSM is
> generally...kind of wonky.
>
> In an ideal world, nbtree page deletion + recycling would have crash
> safety built in. I don't think that it makes sense to not have free
> space management without crash safety in the case of index AMs,
> because it's just not worth it with whole-page units of free space
> (heapam is another story). A 100% crash-safe design would naturally
> shift the problem of nbtree page recycle safety from the
> producer/VACUUM side, to the consumer/_bt_getbuf() side, which I agree
> would be a real improvement. But these long standing FSM issues are
> not going to change for Postgres 14. And so changing _bt_getbuf() to
> do clever things with XIDs won't be possible for Postgres 14 IMV.

Agreed. Thanks for your explanation.

Regards,

[1] https://www.postgresql.org/message-id/flat/CA%2Bfd4k76j8jKzJzcx8UqEugvayaMSnQz0iLUt_XgBp-_-bd22A%40mail.gmail.com

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2021-02-18 11:38:44 Re: Implementing Incremental View Maintenance
Previous Message Joel Jacobson 2021-02-18 11:04:55 Re: Some regular-expression performance hacking