Orphan page in _bt_split

From: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Orphan page in _bt_split
Date: 2025-09-01 05:03:18
Message-ID: 566dacaf-5751-47e4-abc6-73de17a5d42a@garret.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hacker,

The function _bt_split creates new (right) page to split into.
The comment says:

    /*
     * Acquire a new right page to split into, now that left page has a new
     * high key.  From here on, it's not okay to throw an error without
     * zeroing rightpage first.  This coding rule ensures that we won't
     * confuse future VACUUM operations, which might otherwise try to
re-find
     * a downlink to a leftover junk page as the page undergoes deletion.
     *
     * It would be reasonable to start the critical section just after
the new
     * rightpage buffer is acquired instead; that would allow us to avoid
     * leftover junk pages without bothering to zero rightpage. We do
it this
     * way because it avoids an unnecessary PANIC when either origpage
or its
     * existing sibling page are corrupt.
     */

So we should zero this page in case of reporting error to "not confuse
vacuum.
It is done for all places where this function reports error before
entering critical section and wal-logging this page.

But before it there is call to _bt_getbuf:

    /*
     * We have to grab the original right sibling (if any) and update
its prev
     * link.  We are guaranteed that this is deadlock-free, since we couple
     * the locks in the standard order: left to right.
     */
    if (!isrightmost)
    {
        sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);

_bp_getbuf just reads page in a shared buffer. But ReadBuffer actually
can invoke half of Postgres code (locating buffer, evicting victim,
writing to SMGR,..., So there are a lot of places where error can be
thrown (including even such error as statement timeout).
And in this case current transaction will be aborted without zeroing the
right page.
So vacuum will be confused.

It can be quite easily demonstrated by inserting error here:

    /*
     * We have to grab the original right sibling (if any) and update
its prev
     * link.  We are guaranteed that this is deadlock-free, since we couple
     * the locks in the standard order: left to right.
     */
    if (!isrightmost)
    {
        sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
        elog(ERROR, "@@@ Terminated");

And doing something like this:

postgres=# create table t3(pk integer primary key, sk integer, payload
integer);
CREATE TABLE
postgres=# create index on t3(sk);
CREATE INDEX
postgres=# insert into t3 values
(generate_series(1,1000000),random()*1000000,0);
ERROR:  @@@ Terminated
postgres=# vacuum t3;
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

log:

2025-09-01 07:46:37.459 EEST [89290] ERROR:  @@@ Terminated
2025-09-01 07:46:37.459 EEST [89290] STATEMENT:  insert into t3 values
(generate_series(1,1000000),random()*1000000,0);
2025-09-01 07:46:42.391 EEST [89406] LOG:  failed to re-find parent key
in index "t3_sk_idx" for deletion target page 4
2025-09-01 07:46:42.391 EEST [89406] CONTEXT:  while vacuuming index
"t3_sk_idx" of relation "public.t3"
TRAP: failed Assert("false"), File: "nbtpage.c", Line: 2849, PID: 89406
0   postgres                            0x0000000104bf2f64
ExceptionalCondition + 216
1   postgres                            0x00000001043ace94
_bt_lock_subtree_parent + 248
2   postgres                            0x00000001043aaf98
_bt_mark_page_halfdead + 508
3   postgres                            0x00000001043aaaec _bt_pagedel +
1068
4   postgres                            0x00000001043b56cc btvacuumpage
+ 2116
5   postgres                            0x00000001043b4dd4 btvacuumscan
+ 564

This code is not changed for quite long time so I wonder why nobody
noticed this error before?

Proposed patch is attached.
It creates copy of right page in the same way as for left (original) page.
And updates the page only in critical section.

Attachment Content-Type Size
bt_split.patch text/plain 4.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-09-01 05:05:56 Re: Update outdated references to SLRU ControlLock
Previous Message Zhijie Hou (Fujitsu) 2025-09-01 04:45:42 RE: Conflict detection for update_deleted in logical replication