Re: Race conditions with checkpointer and shutdown

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race conditions with checkpointer and shutdown
Date: 2019-06-12 17:42:01
Message-ID: 20190612174201.GA14038@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2019-Apr-29, Tom Lane wrote:

> Ashwin Agrawal <aagrawal(at)pivotal(dot)io> writes:
> > On Mon, Apr 29, 2019 at 10:36 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46
> >> to see if it fixes the problem for you?
>
> > Yes, will give it a try on greenplum and report back the result.
>
> > Have we decided if this will be applied to back branches?

Hi Ashwin, did you have the chance to try this out?

> My feeling about it is "maybe eventually, but most definitely not
> the week before a set of minor releases". Some positive experience
> with Greenplum would help increase confidence in the patch, for sure.

I looked at the buildfarm failures for the recoveryCheck stage. It
looks like there is only one failure for branch master after this
commit, which was chipmunk saying:

# poll_query_until timed out executing this query:
# SELECT application_name, sync_priority, sync_state FROM pg_stat_replication ORDER BY application_name;
# expecting this output:
# standby1|1|sync
# standby2|2|sync
# standby3|2|potential
# standby4|2|potential
# last actual query output:
# standby1|1|sync
# standby2|2|potential
# standby3|2|sync
# standby4|2|potential
# with stderr:
not ok 6 - asterisk comes before another standby name

# Failed test 'asterisk comes before another standby name'
# at t/007_sync_rep.pl line 26.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-05-12%2020%3A37%3A11
AFAICS this is wholly unrelated to the problem at hand.

No other animal failed recoveryCheck test; before the commit, the
failure was not terribly frequent, but rarely would 10 days go by
without it failing. So I suggest that the bug has indeed been fixed.

Maybe now's a good time to get it back-patched? In branch
REL_11_STABLE, it failed as recently as 11 days ago in gull,
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gull&dt=2019-06-01%2004%3A11%3A36

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2019-06-12 18:40:39 Re: Quitting the thes
Previous Message Alvaro Herrera 2019-06-12 17:02:01 Re: proposal: pg_restore --convert-to-text