_bt_split(), and the risk of OOM before its critical section

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: _bt_split(), and the risk of OOM before its critical section
Date: 2019-05-06 19:48:41
Message-ID: CAH2-WzkcWT_-NH7EeL=Az4efg0KCV+wArygW8zKB=+HoP=VWMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Commit 8fa30f906be reduced the elevel of a number of "can't happen"
errors from PANIC to ERROR. These were all critical-section-adjacent
errors involved in nbtree page splits, and nbtree page deletion. It
also established the following convention within _bt_split(), which
allowed Tom to keep the length of the critical section just as short
as it had always been:

/*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
* into origpage on success. rightpage is the new page that receives the
* right-sibling data. If we fail before reaching the critical section,
* origpage hasn't been modified and leftpage is only workspace. In
* principle we shouldn't need to worry about rightpage either, because it
* hasn't been linked into the btree page structure; but to avoid leaving
* possibly-confusing junk behind, we are careful to rewrite rightpage as
* zeroes before throwing any error.
*/

The INCLUDE indexes work looks like it subtly broke this, because it
allocated memory after the initialization of the right page --
allocating memory can always fail. On the other hand, even when
8fa30f906be went in back in 2010 this "rule" was arguably broken,
because we were already calling PageGetTempPage() after the right
sibling page is initialized, which palloc()s a full BLCKSZ, which is
far more than truncation is every likely to allocate.

On the other other hand, it seems to me that the PageGetTempPage()
thing might have been okay, because it happens before the high key is
inserted on the new right buffer page. The same cannot be said for the
way we generate a new high key for the left/old page via suffix
truncation, which happens to occur after the right buffer page is
first modified by inserted its high key (the original/left page's
original high key). I think that there may be a risk that VACUUM's
page deletion code will get confused by finding an errant right
sibling page from a failed page split when there is a high key. If so,
that would be a risk that was introduced in Postgres 11, and made much
more likely in practice in Postgres 12. (I haven't got as far as doing
an analysis of the risks to page deletion, though. The "fastpath"
rightmost page insertion optimization that was also added to Postgres
11 seems like it also might need to be considered here.)

--
Peter Geoghegan

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2019-05-06 19:58:22 Re: Attempt to consolidate reading of XLOG page
Previous Message Alvaro Herrera 2019-05-06 18:21:34 Re: Attempt to consolidate reading of XLOG page