Re: Orphan page in _bt_split

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Orphan page in _bt_split
Date: 2025-09-01 18:35:56
Message-ID: CAH2-Wz=RVy+uFn4SAmu3bNzkJfqOhhnAgamBbQZyMMRz09GeSw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 1, 2025 at 1:03 AM Konstantin Knizhnik <knizhnik(at)garret(dot)ru> wrote:
> So we should zero this page in case of reporting error to "not confuse
> vacuum.
> It is done for all places where this function reports error before
> entering critical section and wal-logging this page.

Right.

> _bp_getbuf just reads page in a shared buffer. But ReadBuffer actually
> can invoke half of Postgres code (locating buffer, evicting victim,
> writing to SMGR,..., So there are a lot of places where error can be
> thrown (including even such error as statement timeout).
> And in this case current transaction will be aborted without zeroing the
> right page.
> So vacuum will be confused.

...

> 2025-09-01 07:46:37.459 EEST [89290] ERROR: @@@ Terminated
> 2025-09-01 07:46:37.459 EEST [89290] STATEMENT: insert into t3 values
> (generate_series(1,1000000),random()*1000000,0);
> 2025-09-01 07:46:42.391 EEST [89406] LOG: failed to re-find parent key
> in index "t3_sk_idx" for deletion target page 4
> 2025-09-01 07:46:42.391 EEST [89406] CONTEXT: while vacuuming index
> "t3_sk_idx" of relation "public.t3"
> TRAP: failed Assert("false"), File: "nbtpage.c", Line: 2849, PID: 89406

> This code is not changed for quite long time so I wonder why nobody
> noticed this error before?

The assertion failure + log output that you've shown is from
_bt_lock_subtree_parent. I'm aware that that error appears from time
to time in the field. When we hit this log message in a non-assert
build, VACUUM should still be able to finish successfully.

My assumption up until now has been that the corruption underlying
these "failed to re-find parent" LOG messages is generally due to
breaking changes to collations (page deletion uses an insertion scan
key to "re-find" a downlink to the page undergoing deletion), and
generic corruption. But it now seems possible that some of them were
due to the problem that you've highlighted.

The problem that you've highlighted happens late within _bt_split,
when the contents of the new right sibling page (everything except its
LSN) has already been written to the page. So I'm fairly confident
that any real world problems would show themselves as "failed to
re-find parent" LOG message from VACUUM (there are no downlinks or
sibling pointers that point to the new right page, so only VACUUM will
ever be able to read the page).

> Proposed patch is attached.
> It creates copy of right page in the same way as for left (original) page.
> And updates the page only in critical section.

I remember that when I worked on what became commit 9b42e71376 back in
2019 (which fixed a similar problem caused by the INCLUDE index
patch), Tom suggested that we do things this way defensively (without
being specifically aware of the _bt_getbuf hazard). That does seem
like the best approach.

I'm a little bit worried about the added overhead of allocating an
extra BLCKSZ buffer for this. I wonder if it would make sense to use
BLCKSZ arrays of char for both of the temp pages.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2025-09-01 19:00:00 Re: Improving tracking/processing of buildfarm test failures
Previous Message Florents Tselai 2025-09-01 15:44:01 Re: split func.sgml to separated individual sgml files