Re: Race conditions with checkpointer and shutdown

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race conditions with checkpointer and shutdown
Date: 2019-04-18 21:57:39
Message-ID: 28461.1555624659@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote (in the other thread):
> Any idea whether it's something newly-introduced or of long standing?

It's the latter. I searched the buildfarm database for failure logs
including the string "server does not shut down" within the last three
years, and got all of the hits attached. Not all of these look like
the failure pattern Michael pointed to, but enough of them do to say
that the problem has existed since at least mid-2017. To be concrete,
we have quite a sample of cases where a standby server has received a
"fast shutdown" signal and acknowledged that in its log, but it never
gets to the expected "shutting down" message, meaning it never starts
the shutdown checkpoint let alone finishes it. The oldest case that
clearly looks like that is

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-06-02%2018%3A54%3A29

A significant majority of the recent cases look just like the piculet
failure Michael pointed to, that is we fail to shut down the "london"
server while it's acting as standby in the recovery/t/009_twophase.pl
test. But there are very similar failures in other tests.

I also notice that the population of machines showing the problem seems
heavily skewed towards, um, weird cases. For instance, in the set
that have shown this type of failure since January, we have

dragonet: uses JIT
francolin: --disable-spinlocks
gull: armv7
mereswine: armv7
piculet: --disable-atomics
sidewinder: amd64, but running netbsd 7 (and this was 9.6, note)
spurfowl: fairly generic amd64

This leads me to suspect that the problem is (a) some very low-level issue
in spinlocks or or latches or the like, or (b) a timing problem that just
doesn't show up on generic Intel-oid platforms. The timing theory is
maybe a bit stronger given that one test case shows this more often than
others. I've not got any clear ideas beyond that.

Anyway, this is *not* new in v12.

regards, tom lane

Attachment Content-Type Size
server-does-not-shut-down.txt text/plain 17.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2019-04-18 21:58:40 Re: Race conditions with checkpointer and shutdown
Previous Message Andres Freund 2019-04-18 21:52:38 Re: finding changed blocks using WAL scanning