Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Date: 2022-02-19 23:08:41
Message-ID: CAH2-WzkiB-qcsBmWrpzP0nxvrQExoUts1d7TYShg_DrkOHeg4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Another testing strategy occurs to me: we could stress-test the
> implementation by simulating an environment where the no-cleanup-lock
> path is hit an unusually large number of times, possibly a fixed
> percentage of the time (like 1%, 5%), say by making vacuumlazy.c's
> ConditionalLockBufferForCleanup() call return false randomly. Now that
> we have lazy_scan_noprune for the no-cleanup-lock path (which is as
> similar to the regular lazy_scan_prune path as possible), I wouldn't
> expect this ConditionalLockBufferForCleanup() testing gizmo to be too
> disruptive.

I tried this out, using the attached patch. It was quite interesting,
even when run against HEAD. I think that I might have found a bug on
HEAD, though I'm not really sure.

If you modify the patch to simulate conditions under which
ConditionalLockBufferForCleanup() fails about 2% of the time, you get
much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze,
without it being so aggressive as to make "make check-world" fail --
which is exactly what I expected. If you are much more aggressive
about it, and make it 50% instead (which you can get just by using the
patch as written), then some tests will fail, mostly for reasons that
aren't surprising or interesting (e.g. plan changes). This is also
what I'd have guessed would happen.

However, it gets more interesting. One thing that I did not expect to
happen at all also happened (with the current 50% rate of simulated
ConditionalLockBufferForCleanup() failure from the patch): if I run
"make check" from the pg_surgery directory, then the Postgres backend
gets stuck in an infinite loop inside lazy_scan_prune, which has been
a symptom of several tricky bugs in the past year (not every time, but
usually). Specifically, the VACUUM statement launched by the SQL
command "vacuum freeze htab2;" from the file
contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this
misbehavior.

This is a temp table, which is a choice made by the tests specifically
because they need to "use a temp table so that vacuum behavior doesn't
depend on global xmin". This is convenient way of avoiding spurious
regression tests failures (e.g. from autoanalyze), and relies on the
GlobalVisTempRels behavior established by Andres' 2020 bugfix commit
94bc27b5.

It's quite possible that this is nothing more than a bug in my
adversarial gizmo patch -- since I don't think that
ConditionalLockBufferForCleanup() can ever fail with a temp buffer
(though even that's not completely clear right now). Even if the
behavior that I saw does not indicate a bug on HEAD, it still seems
informative. At the very least, it wouldn't hurt to Assert() that the
target table isn't a temp table inside lazy_scan_noprune, documenting
our assumptions around temp tables and
ConditionalLockBufferForCleanup().

I haven't actually tried to debug the issue just yet, so take all this
with a grain of salt.

--
Peter Geoghegan

Attachment Content-Type Size
0001-Add-adversarial-ConditionalLockBufferForCleanup-gizm.txt text/plain 2.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-02-19 23:35:18 Re: initdb / bootstrap design
Previous Message Andres Freund 2022-02-19 21:54:46 Re: killing perl2host