Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andrey Borodin <amborodin86(at)gmail(dot)com>, Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
Date: 2023-02-06 21:02:05
Message-ID: 20230206210205.uaaoe2l26fv256hs@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-02-04 11:10:55 -0800, Peter Geoghegan wrote:
> On Sat, Feb 4, 2023 at 2:57 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Is there a good way to make breakage in the page recycling mechanism
> > visible with gist? I guess to see corruption, I'd have to halt a scan
> > before a page is visited with gdb, then cause the page to be recycled
> > prematurely in another session, then unblock the first? Which'd then
> > visit that page, thinking it to be in a different part of the tree than
> > it actually is?
>
> Yes. This bug is similar to an ancient nbtree bug fixed back in 2012,
> by commit d3abbbeb.
>
> > which clearly doesn't seem right.
> >
> > I just can't quite judge how bad that is.
>
> It's really hard to judge, even if you're an expert. We're talking
> about a fairly chaotic scenario. My guess is that there is a very
> small chance of a very unpleasant scenario if you have a GiST index
> that has regular page deletions, and if you use
> vacuum_defer_cleanup_age. It's likely that most GiST indexes never
> have any page deletions due to the workload characteristics.

Thanks.

Sounds like a problem here is too hard to repro. I mostly wanted to know how
to be more confident about a fix working correctly. There's no tests for the
whole page recycling behaviour, afaics, so it's a bit scary to change things
around.

I didn't quite feel confident pushing a fix for this just before a minor
release, so I'll push once the minor releases are tagged. A quite minimal fix
to GetFullRecentGlobalXmin() in 12-13 (returning FirstNormalTransactionId if
epoch == 0 and RecentGlobalXmin > nextxid_xid), and the slightly larger fix in
14+.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2023-02-06 21:13:24 Re: run pgindent on a regular basis / scripted manner
Previous Message Andres Freund 2023-02-06 20:34:49 Re: Exit walsender before confirming remote flush in logical replication