Re: Race condition in crash-recovery tests

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, mikael(dot)kjellstrom(at)gmail(dot)com
Subject: Re: Race condition in crash-recovery tests
Date: 2019-01-27 02:29:37
Message-ID: 20190127022937.nvocrvsok7nlp4vt@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-01-26 20:53:48 -0500, Tom Lane wrote:
> Recently, buildfarm member curculio has started to show a semi-repeatable
> failure in src/test/recovery/t/013_crash_restart.pl:
>
> # aborting wait: program died
> # stream contents: >>psql:<stdin>:8: no connection to the server
> # psql:<stdin>:8: connection to server was lost
> # <<
> # pattern searched for: (?^m:server closed the connection unexpectedly)
>
> # Failed test 'psql query died successfully after SIGKILL'
> # at t/013_crash_restart.pl line 198.
>
> The message this test is looking for is what libpq reports upon getting
> EOF or ECONNRESET from a socket read attempt. The message it's actually
> seeing is what libpq reports if it notices that the PQconn is *already*
> in CONNECTION_BAD state when it's trying to send a new query.
>
> I have no idea why we're seeing this in only one buildfarm member
> and only for the past week or so, as it doesn't appear that any
> related code has changed for months. (Perhaps something changed
> about curculio's host?)

I have no idea why it's just curculio, but I think I know why it only
started recently: Curculio doesn't appear to have tap tests enabled
before
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2019-01-17%2021%3A30%3A02

> just change the test script to accept either message as a successful
> result. I think that 4247db625 made such races more likely, but I
> don't believe it was impossible before.

Sounds right to me - do you want to do the honors or shall I?

> Another idea is to change libpq so that both these cases emit identical
> messages, but I don't really feel that that'd be an improvement. Also,
> since 4247db625 was back-patched, we'd have to back-patch the message
> change as well, which I like even less. People might be relying on
> seeing either message spelling in some situations.

Yea, I don't think that's the way to go.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-01-27 02:32:53 Re: Variable-length FunctionCallInfoData
Previous Message Tom Lane 2019-01-27 01:53:48 Race condition in crash-recovery tests