Re: Recent failures in IsolationCheck deadlock-hard

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Recent failures in IsolationCheck deadlock-hard
Date: 2019-08-14 02:34:23
Message-ID: CA+hUKG+hHKG2eXrz6eW2Ccx0YV8gDMs-YPK_oS34vNHd+8BxGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 6, 2019 at 6:18 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> > On Sat, Aug 3, 2019 at 2:11 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Also worth noting is that anole failed its first try at the new
> >> deadlock-parallel isolation test:
> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-08-01%2015%3A48%3A16
>
> > And friarbird (also CLOBBER_CACHE_ALWAYS) fails every time.
>
> Yeah, there have been half a dozen failures since deadlock-parallel
> went in, mostly on critters that are slowed by CLOBBER_CACHE_ALWAYS
> or valgrind. I've tried repeatedly to reproduce that here, without
> success :-(. It's unclear whether the failures represent a real
> code bug or just a problem in the test case, so I don't really want
> to speculate about fixes till I can reproduce it.

I managed to reproduce a failure that looks a lot like lousyjack's
(note that there are two slightly different failure modes):

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2019-08-05%2011:33:02

I did that by changing the deadlock_timeout values for sessions d1 and
d2 to just a few milliseconds on my slowest computer, guessing that
this might be a race involving the deadlock timeout and the time it
takes for workers to fork and join a lock queue. While normally
deadlock.c with DEBUG_DEADLOCK defined prints out something like this
during this test:

DeadLockCheck: lock 0x80a2812d0 queue 33087 33088 33089 33090 33091
rearranged to: lock 0x80a2812d0 queue 33091 33090 33089 33088 33087

... when it failed like lousyjack my run printed out:

DeadLockCheck: lock 0x80a2721f8 queue 33108 33114
rearranged to: lock 0x80a2721f8 queue 33114 33108

... and then it hung for a while, so I could inspect the lock table
and see that PID 33108 was e1l (not granted), and PID 33114 was gone
but was almost certainly the first worker for d2a1 (I can tell because
33110-33113 are the workers for d1a2 and they're still waiting and
d2a1's first worker should have had the next sequential PID, on my
OS).

Another thing I noticed is that all 4 times I managed to reproduce
this, the "rearranged to" queue had only two entries; I can understand
that d1's workers might not feature yet due to bad timing, but it's
not clear to me why there should always be only one d2a1 worker and
not more. I don't have time to study this further today and I might
be way off, but my first guess is that in theory we need a way to make
sure that the d1-e2 edge exists before d2's deadlock timer expires,
no? That's pretty tricky though, so maybe we just need to crank the
times up.

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuro Yamada 2019-08-14 02:38:01 Re: [HACKERS] CLUSTER command progress monitor
Previous Message Bruce Momjian 2019-08-14 02:10:50 Re: pg_upgrade fails with non-standard ACL