Re: _bt_split(), and the risk of OOM before its critical section

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: _bt_split(), and the risk of OOM before its critical section
Date: 2019-05-06 23:11:55
Message-ID: CAH2-Wz=r2ae=mLvD-o5hjzcbARBXJSXDZcL_v1J=FCzCb0Vqaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 6, 2019 at 3:29 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Yeah, as _bt_split is currently coded, _bt_truncate has to be a "no
> errors" function, which it isn't. The pfree for its result is being
> done in an ill-chosen place, too.

I am tempted to move the call to _bt_truncate() out of _bt_split()
entirely on HEAD, possibly relocating it to
nbtsplitloc.c/_bt_findsplitloc(). That way, there is a clearer
separation between how split points are chosen, suffix truncation, and
the mechanical process of executing a legal page split.

> Another problem now that I look at it is that the _bt_getbuf for the right
> sibling is probably not too safe. And the _bt_vacuum_cycleid() call seems
> a bit dangerous from this standpoint as well.

Yeah, we can tighten those up without much difficulty.

> I'm not really concerned about that one because at that point the
> right page is still in a freshly-pageinit'd state. It's perhaps
> not quite as nice as having it be zeroes, but it won't look like
> it has any interesting data.

The important question is how VACUUM will recognize it. It's clearly
not as bad as something that causes "failed to re-find parent key"
errors, but I think that VACUUM might not be reclaiming it for the FSM
(haven't checked). Note that _bt_unlink_halfdead_page() is perfectly
happy to ignore the fact that the left sibling of a half-dead page has
a rightlink that doesn't point back to the target. Because, uh, there
might have been a concurrent page deletion, somehow.

We have heard a lot about "failed to re-find parent key" errors from
VACUUM before now because that is about the only strong cross-check
that it does. (Not that I'm arguing that we need more of that.)

> In any case, once we've started to fill the ropaque area, it would really
> be better if we don't call anything that could throw errors.
>
> Maybe we should bite the bullet and use two temp pages, so that none
> of the data ends up in the shared buffer arena until we reach the
> critical section? The extra copying is slightly annoying, but
> it certainly seems like enforcing this invariant over such a
> long stretch of code is not very maintainable.

While I think that the smarts we have around deciding a split point
will probably improve in future releases, and that we'll probably make
_bt_truncate() itself do more, the actual business of performing a
split has no reason to change that I can think of. I would like to
keep _bt_split() as simple as possible anyway -- it should only be
copying tuples using simple primitives like the bufpage.c routines.
Living with what we have now (not using a temp buffer for the right
page) seems fine.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul Jungwirth 2019-05-06 23:21:45 Re: range_agg
Previous Message Tom Lane 2019-05-06 22:29:11 Re: _bt_split(), and the risk of OOM before its critical section