Re: subscriptionCheck failures on nightjar

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: subscriptionCheck failures on nightjar
Date: 2019-09-20 21:49:27
Message-ID: 2636.1569016167@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2019-09-20 16:25:21 -0400, Tom Lane wrote:
>> I recreated my freebsd-9-under-qemu setup and I can still reproduce
>> the problem, though not with high reliability (order of 1 time in 10).
>> Anything particular you want logged?

> A DEBUG2 log would help a fair bit, because it'd log some information
> about what changes the "horizons" determining when data may be removed.

Actually, what I did was as attached [1], and I am getting traces like
[2]. The problem seems to occur only when there are two or three
processes concurrently creating the same snapshot file. It's not
obvious from the debug trace, but the snapshot file *does* exist
after the music stops.

It is very hard to look at this trace and conclude anything other
than "rename(2) is broken, it's not atomic". Nothing in our code
has deleted the file: no checkpoint has started, nor do we see
the DEBUG1 output that CheckPointSnapBuild ought to produce.
But fsync_fname momentarily can't see it (and then later another
process does see it).

It is now apparent why we're only seeing this on specific ancient
platforms. I looked around for info about rename(2) not being
atomic, and I found this info about FreeBSD:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=94849

The reported symptom there isn't quite the same, so probably there
is another issue, but there is plenty of reason to be suspicious
that UFS rename(2) is buggy in this release. As for dromedary's
ancient version of macOS, Apple is exceedinly untransparent about
their bugs, but I found

http://www.weirdnet.nl/apple/rename.html

In short, what we got here is OS bugs that have probably been
resolved years ago.

The question is what to do next. Should we just retire these
specific buildfarm critters, or do we want to push ahead with
getting rid of the PANIC here?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-09-20 21:51:06 Re: subscriptionCheck failures on nightjar
Previous Message Andres Freund 2019-09-20 21:26:03 Re: subscriptionCheck failures on nightjar