Re: 64-bit XIDs in deleted nbtree pages

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: 64-bit XIDs in deleted nbtree pages
Date: 2021-02-15 11:14:48
Message-ID: CAD21AoAaHg86bGm=k8cBtK9HeO46QGRMX4pxNt5gt_11ispFGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 14, 2021 at 3:47 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Fri, Feb 12, 2021 at 9:04 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > I agree that there already are huge problems in that case. But I think
> > > we need to consider an append-only case as well; after bulk deletion
> > > on an append-only table, vacuum deletes heap tuples and index tuples,
> > > marking some index pages as dead and setting an XID into btpo.xact.
> > > Since we trigger autovacuums even by insertions based on
> > > autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
> > > the table again. But if there is a long-running query a "wasted"
> > > cleanup scan could happen many times depending on the values of
> > > autovacuum_vacuum_insert_scale_factor/threshold and
> > > vacuum_cleanup_index_scale_factor. This should not happen in the old
> > > code. I agree this is DBA problem but it also means this could bring
> > > another new problem in a long-running query case.
> >
> > I see your point.
>
> My guess is that this concern of yours is somehow related to how we do
> deletion and recycling *in general*. Currently (and even in v3 of the
> patch), we assume that recycling the pages that a VACUUM operation
> deletes will happen "eventually". This kind of makes sense when you
> have "typical vacuuming" -- deletes/updates, and no big bursts, rare
> bulk deletes, etc.
>
> But when you do have a mixture of different triggering positions,
> which is quite possible, it is difficult to understand what
> "eventually" actually means...
>
> > BTW, I am thinking about making recycling take place for pages that
> > were deleted during the same VACUUM. We can just use a
> > work_mem-limited array to remember a list of blocks that are deleted
> > but not yet recyclable (plus the XID found in the block).
>
> ...which brings me back to this idea.
>
> I've prototyped this. It works really well. In most cases the
> prototype makes VACUUM operations with nbtree index page deletions
> also recycle the pages that were deleted, at the end of the
> btvacuumscan(). We do very little or no "indefinite deferring" work
> here. This has obvious advantages, of course, but it also has a
> non-obvious advantage: the awkward question of concerning "what
> eventually actually means" with mixed triggering conditions over time
> mostly goes away. So perhaps this actually addresses your concern,
> Masahiko.

Yes. I think this would simplify the problem by resolving almost all
problems related to indefinite deferring page recycle.

We will be able to recycle almost all just-deleted pages in practice
especially when btvacuumscan() took a long time. And there would not
be a noticeable downside, I think.

BTW if btree index starts to use maintenan_work_mem for this purpose,
we also need to set amusemaintenanceworkmem to true which is
considered when parallel vacuum.

>
> I've been testing this with BenchmarkSQL [1], which has several
> indexes that regularly need page deletions. There is also a realistic
> "life cycle" to the data in these indexes. I added custom
> instrumentation to display information about what's going on with page
> deletion when the benchmark is run. I wrote a quick-and-dirty patch
> that makes log_autovacuum show the same information that you see about
> index page deletion when VACUUM VERBOSE is run (including the new
> pages_newly_deleted field from my patch). With this particular
> TPC-C/BenchmarkSQL workload, VACUUM seems to consistently manage to go
> on to place every page that it deletes in the FSM without leaving
> anything to the next VACUUM. There are a very small number of
> exceptions where we "only" manage to recycle maybe 95% of the pages
> that were deleted.

Great!

>
> The race condition that nbtree avoids by deferring recycling was
> always a narrow one, outside of the extremes -- the way we defer has
> always been overkill. It's almost always unnecessary to delay placing
> deleted pages in the FSM until the *next* VACUUM. We only have to
> delay it until the end of the *same* VACUUM -- why wait until the next
> VACUUM if we don't have to? In general this deferring recycling
> business has nothing to do with MVCC/GC/whatever, and yet the code
> seems to suggest that it does. While it is convenient to use an XID
> for page deletion and recycling as a way of implementing what Lanin &
> Shasha call "the drain technique" [2], all we have to do is prevent
> certain race conditions. This is all about the index itself, the data
> structure, how it is maintained -- nothing more. It almost seems
> obvious to me.

Agreed.

>
> It's still possible to imagine extremes. Extremes that even the "try
> to recycle pages we ourselves deleted when we reach the end of
> btvacuumscan()" version of my patch cannot deal with. Maybe it really
> is true that it's inherently impossible to recycle a deleted page even
> at the end of a VACUUM -- maybe a long-running transaction (that could
> in principle have a stale link to our deleted page) starts before we
> VACUUM, and lasts after VACUUM finishes. So it's just not safe. When
> that happens, we're back to having the original problem: we're relying
> on some *future* VACUUM operation to do that for us at some indefinite
> point in the future. It's fair to wonder: What are the implications of
> that? Are we not back to square one? Don't we have the same "what does
> 'eventually' really mean" problem once again?
>
> I think that that's okay, because this remaining case is a *truly*
> extreme case (especially with a large index, where index vacuuming
> will naturally take a long time).

Right.

>
> It will be rare. But more importantly, the fact that scenario is now
> an extreme case justifies treating it as an extreme case. We can teach
> _bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In
> particular, I think that it will now be okay to increase the threshold
> applied when considering deleted pages inside
> _bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the
> patch. But in v4, which has the new recycling enhancement, I think
> that it would be sensible to make it 5%, or maybe even 10%. This
> naturally makes Masahiko's problem scenario unlikely to actually
> result in a truly wasted call to btvacuumscan(). The number of pages
> that the metapage indicates are "deleted but not yet placed in the
> FSM" will be close to the theoretical minimum, because we're no longer
> naively throwing away information about which specific pages will be
> recyclable soon. Which is what the current approach does, really.
>

Yeah, increasing the threshold would solve the problem in most cases.
Given that nbtree index page deletion is unlikely to happen in
practice, having the threshold 5% or 10% seems to avoid the problem in
nearly 100% of cases, I think.

Another idea I come up with (maybe on top of above your idea) is to
change btm_oldest_btpo_xact to 64-bit XID and store the *newest*
btpo.xact XID among all deleted pages when the total amount of deleted
pages exceeds 2% of index. That way, we surely can recycle more than
2% of index when the XID becomes older than the global xmin.

Also, maybe we can record deleted pages to FSM even without deferring
and check it when re-using. That is, when we get a free page from FSM
we check if the page is really recyclable (maybe _bt_getbuf() already
does this?). IOW, a deleted page can be recycled only when it's
requested to be reused. If btpo.xact is 64-bit XID we never need to
worry about the case where a deleted page never be requested to be
reused.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-02-15 11:25:27 Re: Refactoring HMAC in the core code
Previous Message Andrey Borodin 2021-02-15 10:55:49 Re: increase size of pg_commit_ts buffers