Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Date: 2022-03-31 02:00:18
Message-ID: 20220331020018.qht46hbua4gztmdf@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-03-30 17:50:42 -0700, Peter Geoghegan wrote:
> I tried several times to recreate this issue on CI. No luck with that,
> though -- can't get it to fail again after 4 attempts.

It's really annoying that we don't have Assert variants that show the compared
values, that might make it easier to intepret what's going on.

Something vaguely like EXPECT_EQ_U32 in regress.c. Maybe
AssertCmp(type, a, op, b),

Then the assertion could have been something like
AssertCmp(int32, diff, >, 0)

Does the line number in the failed run actually correspond to the xid, rather
than the mxid case? I didn't check.

You could try to increase the likelihood of reproducing the failure by
duplicating the invocation that lead to the crash a few times in the
.cirrus.yml file in your dev branch. That might allow hitting the problem more
quickly.

Maybe reduce autovacuum_naptime in src/tools/ci/pg_ci_base.conf?

Or locally - one thing that windows CI does different from the other platforms
is that it runs isolation, contrib and a bunch of other tests using the same
cluster. Which of course increases the likelihood of autovacuum having stuff
to do, *particularly* on shared relations - normally there's probably not
enough changes for that.

You can do something similar locally on linux with
make -Otarget -C contrib/ -j48 -s USE_MODULE_DB=1 installcheck prove_installcheck=true
(the prove_installcheck=true to prevent tap tests from running, we don't seem
to have another way for that)

I don't think windows uses USE_MODULE_DB=1, but it allows to cause a lot more
load concurrently than running tests serially...

> We know that this particular assertion did not fail during the same VACUUM:
>
> Assert(vacrel->NewRelfrozenXid == OldestXmin ||
> TransactionIdPrecedesOrEquals(vacrel->relfrozenxid,
> vacrel->NewRelfrozenXid));

The comment in your patch says "is either older or newer than FreezeLimit" - I
assume that's some rephrasing damage?

> So it's hard to see how this could be a bug in the patch -- the final
> new relfrozenxid is presumably equal to VACUUM's OldestXmin in the
> problem scenario seen on the CI Windows instance yesterday (that's why
> this earlier assertion didn't fail).

Perhaps it's worth commiting improved assertions on master? If this is indeed
a pre-existing bug, and we're just missing due to slightly less stringent
asserts, we could rectify that separately.

> The surprising part of the CI failure must have taken place just after
> this assertion, when VACUUM's call to vacuum_set_xid_limits() actually
> updates pg_class.relfrozenxid with vacrel->NewRelfrozenXid --
> presumably because the existing relfrozenxid appeared to be "in the
> future" when we examine it in pg_class again. We see evidence that
> this must have happened afterwards, when the closely related assertion
> (used only in instrumentation code) fails:

Hm. This triggers some vague memories. There's some oddities around shared
relations being vacuumed separately in all the databases and thus having
separate horizons.

After "remembering" that, I looked in the cirrus log for the failed run, and
the worker was processing shared a shared relation last:

2022-03-30 03:48:30.238 GMT [5984][autovacuum worker] LOG: automatic analyze of table "contrib_regression.pg_catalog.pg_authid"

Obviously that's not a guarantee that the next table processed also is a
shared catalog, but ...

Oh, the relid is actually in the stack trace. 0x4ee = 1262 =
pg_database. Which makes sense, the test ends up with a high percentage of
dead rows in pg_database, due to all the different contrib tests
creating/dropping a database.

> From my patch:
>
> > if (frozenxid_updated)
> > {
> > - diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
> > + diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
> > + Assert(diff > 0);
> > appendStringInfo(&buf,
> > _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
> > - FreezeLimit, diff);
> > + vacrel->NewRelfrozenXid, diff);
> > }

Perhaps this ought to be an elog() instead of an Assert()? Something has gone
pear shaped if we get here... It's a bit annoying though, because it'd have to
be a PANIC to be visible on the bf / CI :(.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2022-03-31 02:01:44 Re: Commitfest Update
Previous Message Peter Geoghegan 2022-03-31 00:50:42 Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations