Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Date: 2021-06-16 20:21:58
Message-ID: CAH2-Wzk_L7Z7LREHTtg5vY08eeWdnHO70m98eWx4U1uwvW=0sA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 16, 2021 at 12:22 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I think it's more complicated than that - "before" isn't a guarantee when the
> horizon can go backwards. Consider the case where a hot_standby_feedback=on
> replica without a slot connects - that can result in the xid horizon going
> backwards.

Oh yeah, I think that I get it now. Tell me if this sounds right to you:

It's not so much that HeapTupleSatisfiesVacuum() "disagrees" with
heap_prune_satisfies_vacuum() in a way that actually matters to
VACUUM. While there does seem to be a fairly mundane bug in
GetOldestNonRemovableTransactionId() that really is a matter of
disagreement between the two functions, the fundamental issue is
deeper than that. The fundamental issue is that it's not okay to
assume that the XID horizon won't go backwards. This probably matters
for lots of reasons. The most obvious reason is that in theory it
could cause lazy_scan_prune() to get stuck in about the same way as
Justin reported, with the GetOldestNonRemovableTransactionId() bug.

This isn't an issue in the backbranches because we're using the same
OldestXmin value directly when calling heap_page_prune(). We only ever
have one xid horizon cutoff like that per VACUUM (we only have
OldestXmin, no vistest), so clearly it's not a problem.

> I think a good way to address this might be to have GlobalVisUpdateApply()
> ensure that maybe_needed does not go backwards within one backend.
>
> This is *nearly* already guaranteed within vacuum, except for the case where a
> catalog access between vacuum_set_xid_limits() and GlobalVisTestFor() could
> lead to an attempt at pruning, which could move maybe_needed to go backwards
> theoretically if inbetween those two steps a replica connected that causes the
> horizon to go backwards.

This would at least be easy to test. I like the idea of adding invariants.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-06-16 21:13:16 Re: Python 3.10 breaks regression tests with traceback changes
Previous Message Zhihong Yu 2021-06-16 20:18:16 Re: A qsort template