Re: Intermittent buildfarm failures on wrasse

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Noah Misch <noah(at)leadboat(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, David Rowley <dgrowleyml(at)gmail(dot)com>
Subject: Re: Intermittent buildfarm failures on wrasse
Date: 2022-04-15 16:16:40
Message-ID: 1648548.1650039400@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> Off for a bit, but I realized that we likely don't exclude the launcher because it's not database associated...

Yeah. I think this bit in ComputeXidHorizons needs rethinking:

/*
* Normally queries in other databases are ignored for anything but
* the shared horizon. ...
*/
if (in_recovery ||
MyDatabaseId == InvalidOid || proc->databaseId == MyDatabaseId ||
proc->databaseId == 0) /* always include WalSender */
{

The "proc->databaseId == 0" business apparently means to include only
walsender processes, and it's broken because that condition doesn't
include only walsender processes.

At this point we have the following conclusions:

1. A slow transaction in the launcher's initial get_database_list()
call fully explains these failures. (I had been thinking that the
launcher's xact would have to persist as far as the create_index
script, but that's not so: it only has to last until test_setup
begins vacuuming tenk1. The CREATE INDEX steps are not doing any
visibility map changes of their own, but what they are doing is
updating relallvisible from the results of visibilitymap_count().
That's why they undid the effects of manually poking relallvisible,
without actually inserting any data better than what the initial
VACUUM computed.)

2. We can probably explain why only wrasse sees this as some quirk
of the Solaris scheduler. I'm satisfied to blame it-happens-in-
installcheck-but-not-check on that too.

3. It remains unclear why we suddenly started seeing this last week.
I suppose it has to be a side-effect of the pgstats changes, but
the mechanism is obscure. Probably not worth the effort to pin
down exactly why.

As for fixing it, what I think would be the preferable answer is to
fix the above-quoted logic so that it indeed includes only walsenders
and not random other background workers. (Why does it need to include
walsenders, anyway? The commentary sucks.) Alternatively, or perhaps
also, we could do what was discussed previously and make a hack to
allow delaying vacuum until the system is quiescent.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-04-15 16:17:56 Re: Intermittent buildfarm failures on wrasse
Previous Message Euler Taveira 2022-04-15 15:48:30 Re: Inconsistent "ICU Locale" output on older server versions