Re: recent failures on lorikeet

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: recent failures on lorikeet
Date: 2021-06-14 16:33:18
Message-ID: ce94774f-0583-7be2-8ec3-2bb161b959fd@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 6/14/21 9:39 AM, Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> I've been looking at the recent spate of intermittent failures on my
>> Cygwin animal lorikeet. Most of them look something like this, where
>> there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE'
>> which fails.
> Do you have any idea what "exit code 127" signifies on that platform?
> (BTW, not all of them look like that; many are reported as plain
> segfaults.) I hadn't spotted the association with a concurrent "VACUUM
> FULL pg_class" before, that does seem interesting.
>
>> Getting stack traces in this platform can be very difficult. I'm going
>> to try forcing complete serialization of the regression tests
>> (MAX_CONNECTIONS=1) to see if the problem goes away. Any other
>> suggestions might be useful. Note that we're not getting the same issue
>> on REL_13_STABLE, where the same group pf tests run together (inherit
>> typed_table, vacuum)
> If it does go away, that'd be interesting, but I don't see how it gets
> us any closer to a fix. Seems like a stack trace is a necessity to
> narrow it down.
>
>

Some have given stack traces and some not, not sure why. The one from
June 13 has this:

---- backtrace ----
??
??:0
WaitOnLock
src/backend/storage/lmgr/lock.c:1831
LockAcquireExtended
src/backend/storage/lmgr/lock.c:1119
LockRelationOid
src/backend/storage/lmgr/lmgr.c:135
relation_open
src/backend/access/common/relation.c:59
table_open
src/backend/access/table/table.c:43
ScanPgRelation
src/backend/utils/cache/relcache.c:322
RelationBuildDesc
src/backend/utils/cache/relcache.c:1039
RelationIdGetRelation
src/backend/utils/cache/relcache.c:2045
relation_open
src/backend/access/common/relation.c:59
table_open
src/backend/access/table/table.c:43
ExecInitPartitionInfo
src/backend/executor/execPartition.c:510
ExecPrepareTupleRouting
src/backend/executor/nodeModifyTable.c:2311
ExecModifyTable
src/backend/executor/nodeModifyTable.c:2559
ExecutePlan
src/backend/executor/execMain.c:1557

The line in lmgr.c is where the process title gets changed to "waiting".
I recently stopped setting process title on this animal on REL_13_STABLE
and its similar errors have largely gone away. I can do the same on
HEAD. But it does make me wonder what the heck has changed to make this
code fragile.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2021-06-14 16:37:58 Re: PG 14 release notes, first draft
Previous Message Tomas Vondra 2021-06-14 16:28:01 Re: Use extended statistics to estimate (Var op Var) clauses