Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.
Date: 2016-02-11 15:42:23
Message-ID: 14581.1455205343@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> That would be great. Taking a look at what happened, I have a feeling
>> this may be a race condition of some kind in the isolation tester. It
>> seems to have failed to recognize that a1 started waiting, and that
>> caused the "deadlock detected" message to reported differently. I'm
>> not immediately sure what to do about that.

> Yeah, so: try_complete_step() waits 10ms, and if it still hasn't
> gotten any data back from the server, then it uses a separate query to
> see whether the step in question is waiting on a lock. So what
> must've happened here is that it took more than 10ms for the process
> to show up as waiting in pg_stat_activity.

No, because the machines that are failing are showing a "<waiting ...>"
annotation that your reference output *doesn't* have. I think what is
actually happening is that these machines are seeing the process as
waiting and reporting it, whereas on your machine the backend detects
the deadlock and completes the query (with an error) before
isolationtester realizes that the process is waiting.

It would probably help if you didn't do this:

setup { BEGIN; SET deadlock_timeout = '10ms'; }

which pretty much guarantees that there is a race condition: you've set it
so that the deadlock detector will run at approximately the same time when
isolationtester will be probing the state. I'm surprised that it seemed
to act consistently for you. I would suggest putting all the other
sessions to deadlock_timeout of 100s and the one you want to fail to
timeout of ~ 5s. That will mean that the "<waiting ...>" output should
show up pretty reliably even on overloaded buildfarm critters.

regards, tom lane

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2016-02-11 16:31:04 pgsql: Code review for isolationtester changes.
Previous Message Teodor Sigaev 2016-02-11 15:11:26 pgsql: Improve error reporting in format()

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-02-11 15:55:27 Re: max_parallel_degree context level
Previous Message Simon Riggs 2016-02-11 15:32:41 Re: max_parallel_degree context level