Re: _bt_split(), and the risk of OOM before its critical section

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: _bt_split(), and the risk of OOM before its critical section
Date: 2019-05-06 22:29:11
Message-ID: 17768.1557181751@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Peter Geoghegan <pg(at)bowt(dot)ie> writes:
> Commit 8fa30f906be reduced the elevel of a number of "can't happen"
> errors from PANIC to ERROR. These were all critical-section-adjacent
> errors involved in nbtree page splits, and nbtree page deletion. It
> also established the following convention within _bt_split(), which
> allowed Tom to keep the length of the critical section just as short
> as it had always been:

> /*
> * origpage is the original page to be split. leftpage is a temporary
> * buffer that receives the left-sibling data, which will be copied back
> * into origpage on success. rightpage is the new page that receives the
> * right-sibling data. If we fail before reaching the critical section,
> * origpage hasn't been modified and leftpage is only workspace. In
> * principle we shouldn't need to worry about rightpage either, because it
> * hasn't been linked into the btree page structure; but to avoid leaving
> * possibly-confusing junk behind, we are careful to rewrite rightpage as
> * zeroes before throwing any error.
> */

> The INCLUDE indexes work looks like it subtly broke this, because it
> allocated memory after the initialization of the right page --
> allocating memory can always fail.

Yeah, as _bt_split is currently coded, _bt_truncate has to be a "no
errors" function, which it isn't. The pfree for its result is being
done in an ill-chosen place, too.

Another problem now that I look at it is that the _bt_getbuf for the right
sibling is probably not too safe. And the _bt_vacuum_cycleid() call seems
a bit dangerous from this standpoint as well.

> On the other hand, even when
> 8fa30f906be went in back in 2010 this "rule" was arguably broken,
> because we were already calling PageGetTempPage() after the right
> sibling page is initialized, which palloc()s a full BLCKSZ, which is
> far more than truncation is every likely to allocate.

I'm not really concerned about that one because at that point the
right page is still in a freshly-pageinit'd state. It's perhaps
not quite as nice as having it be zeroes, but it won't look like
it has any interesting data. (But, having said that, we could
certainly reorder the code to construct the temp page first.)

In any case, once we've started to fill the ropaque area, it would really
be better if we don't call anything that could throw errors.

Maybe we should bite the bullet and use two temp pages, so that none
of the data ends up in the shared buffer arena until we reach the
critical section? The extra copying is slightly annoying, but
it certainly seems like enforcing this invariant over such a
long stretch of code is not very maintainable.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2019-05-06 23:11:55 Re: _bt_split(), and the risk of OOM before its critical section
Previous Message David Fetter 2019-05-06 22:07:04 Re: range_agg