Quick Links

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date:	2024-01-06 22:44:48
Message-ID:	20240106224448.c7@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On Sat, Jan 06, 2024 at 01:30:40PM -0800, Peter Geoghegan wrote:
> On Sat, Jan 6, 2024 at 12:24 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote:
> > > My guess is that there is a decent chance that backpatching 1ccc1e05ae
> > > would be okay, but that isn't much use. I really don't know either way
> > > right now. And I wouldn't like to speculate too much further before
> > > gaining a proper understanding of what's going on here.
> >
> > Fair enough. While I agree there's a decent chance back-patching would be
> > okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
> > Matthias theorized. Something like: we update relfrozenxid based on
> > OldestXmin, even though GlobalVisState caused us to retain a tuple older than
> > OldestXmin. Then relfrozenxid disagrees with table contents.
>
> Either every relevant code path has the same OldestXmin to work off
> of, or the whole NewRelfrozenXid/relfrozenxid-tracking thing can't be
> expected to work as designed. I find it a bit odd that
> pruneheap.c/GlobalVisState has no direct understanding of this
> dependency (none that I can discern, at least). Wouldn't it at least
> be more natural if pruneheap.c could access OldestXmin when run inside
> VACUUM? (Could just be used by defensive hardening code.)

Tied to that decision is the choice of semantics when the xmin horizon moves
backward during one VACUUM, e.g. when a new walsender xmin does so. Options:

1. Continue to remove tuples based on the OldestXmin from VACUUM's start. We
could have already removed some of those tuples, so the walsender xmin
won't achieve a guarantee anyway. (VACUUM would want ratchet-like behavior
in GlobalVisState, possibly by sharing OldestXmin with pruneheap like you
say.)

2. Move OldestXmin backward, to reflect the latest xmin horizon. (Perhaps
VACUUM would just pass GlobalVisState to a function that returns the
compatible OldestXmin.)

Which way is better?

> We're also relying on vacuumlazy.c's call to vacuum_get_cutoffs()
> (which itself calls GetOldestNonRemovableTransactionId) taking place
> before vacuumlazy.c goes on to call GlobalVisTestFor() a few lines
> further down (I think). It seems like even the code in procarray.c
> might have something to say about the vacuumlazy.c dependency, too.
> But offhand it doesn't look like it does, either. Why shouldn't we
> expect random implementation details in code like ComputeXidHorizons()
> to break the assumption/dependency within vacuumlazy.c?

Makes sense.

On Sat, Jan 06, 2024 at 01:41:23PM -0800, Peter Geoghegan wrote:
> What do you think of the idea of adding a defensive "can't happen"
> error to lazy_scan_prune() that will catch DEAD or RECENTLY_DEAD
> tuples with storage that still contain an xmax < OldestXmin? This
> probably won't catch every possible problem, but I suspect it'll work
> well enough.

So before the "goto retry", ERROR if the tuple header suggests this mismatch
is happening? That, or limiting the retry count, could help.

In response to

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() at 2024-01-06 21:41:23 from Peter Geoghegan

Responses

Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() at 2024-01-08 17:02:01 from Peter Geoghegan
relfrozenxid may disagree with row XIDs after 1ccc1e05ae at 2024-03-04 00:07:17 from Noah Misch

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Alexander Korotkov	2024-01-07 07:31:44	Re: BUG #16925: ERROR: invalid DSA memory alloc request size 1073741824 CONTEXT: parallel worker
Previous Message	PG Bug reporting form	2024-01-06 22:20:36	BUG #18274: Error 'invalid XML content'