Re: pgsql: Test replay of regression tests, attempt II.

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <tmunro(at)postgresql(dot)org>, pgsql-committers <pgsql-committers(at)lists(dot)postgresql(dot)org>
Subject: Re: pgsql: Test replay of regression tests, attempt II.
Date: 2022-01-20 04:23:30
Message-ID: CA+hUKG+nHX+NNjm-ig0zWLxeMiivH8omey5Onfhnxzh6g524Cg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

On Wed, Jan 19, 2022 at 12:08 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2022-01-18 17:19:06 -0500, Tom Lane wrote:
> > Andres Freund <andres(at)anarazel(dot)de> writes:
> > > That's an extremely small shared_buffers for running the regression tests, it'd not
> > > be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM
> > > ends up skipping over a page because of page contention?
> >
> > Hmm, good thought. I tried running the test with even smaller
> > shared_buffers, but could not make the reloptions test fall over for
> > me. But this theory implies a strong timing dependency, so it might
> > still only happen on particular machines. (If anyone else tries it:
> > below about 400kB, other tests start failing with "no free unpinned
> > buffers" and the like.)
>
> I ran the test in a loop for 200+ times now, without reproducing the
> problem. Rorqual runs on a shared machine though, so it's quite possible that
> IO will be slower, and thus triggering the issue.
>
> I was wondering whether we could use VACUUM VERBOSE for that specific VACUUM -
> that'd show information about the number of pages with tuples etc. But I don't
> currently see a way of that causing the regression tests to fail.
>
> Even if I set client_min_messages=error, the messages still get sent to the
> client, because elevel == INFO is special cased in
> should_output_to_client(). And I don't see a way of redirecting the output of
> common.c:NoticeProcessor() in psql either.

I hacked a branch thusly:

@@ -327,6 +327,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
verbose = (params->options & VACOPT_VERBOSE) != 0;
instrument = (verbose || (IsAutoVacuumWorkerProcess() &&

params->log_min_duration >= 0));
+ instrument = true;
if (instrument)
{
pg_rusage_init(&ru0);

Having failed to reproduce this locally, I clicked on "re-run tests"
all afternoon on CI until eventually I captured a failure log[1]
there, with the smoking gun:

pages: 0 removed, 1 remain, 1 skipped due to pins, 0 skipped frozen

There are three places that skip and bump that counter, but two of
them were disabled when I added DISABLE_PAGE_SKIPPING, leaving this
one:

LockBuffer(buf, BUFFER_LOCK_SHARE);
if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
{
UnlockReleaseBuffer(buf);
vacrel->scanned_pages++;
vacrel->pinskipped_pages++;
if (hastup)
vacrel->nonempty_pages = blkno + 1;
continue;
}

Since this page doesn't require wraparound vacuuming, if we fail to
conditionally acquire the cleanup lock, this block skips the page.

[1] https://api.cirrus-ci.com/v1/artifact/task/5096848598761472/log/src/test/recovery/tmp_check/log/027_stream_regress_primary.log

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Andres Freund 2022-01-20 05:24:04 Re: pgsql: Test replay of regression tests, attempt II.
Previous Message Tom Lane 2022-01-20 01:07:31 Re: pgsql: Make configure prefer python3 to plain python.