Re: subscriptionCheck failures on nightjar

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: subscriptionCheck failures on nightjar
Date: 2019-08-26 13:29:04
Message-ID: 20190826132904.3ayuw36qzl2c4ktr@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 13, 2019 at 05:04:35PM +0900, Michael Paquier wrote:
>On Wed, Feb 13, 2019 at 01:51:47PM -0800, Andres Freund wrote:
>> I'm not yet sure that that's actually something that's supposed to
>> happen, I got to spend some time analysing how this actually
>> happens. Normally the contents of the slot should actually prevent it
>> from being removed (as they're newer than
>> ReplicationSlotsComputeLogicalRestartLSN()). I kind of wonder if that's
>> a bug in the drop logic in newer releases.
>
>In the same context, could it be a consequence of 9915de6c which has
>introduced a conditional variable to control slot operations? This
>could have exposed more easily a pre-existing race condition.
>--

This is one of the remaining open items, and we don't seem to be moving
forward with it :-(

I'm willing to take a stab at it, but to do that I need a way to
reproduce it. Tom, you mentioned you've managed to reproduce it in a
qemu instance, but that it took some fiddling with qemu parmeters or
something. Can you share what exactly was necessary?

An observation about the issue - while we started to notice this after
Decemeber, that's mostly because the PANIC patch went it shortly before.
We've however seen the issue before, as Thomas Munro mentioned in [1].

Those reports are from August, so it's quite possible something in the
first CF upset the code. And there's only a single commit in 2018-07
that seems related to logical decoding / snapshots [2], i.e. f49a80c:

commit f49a80c481f74fa81407dce8e51dea6956cb64f8
Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Date: Tue Jun 26 16:38:34 2018 -0400

Fix "base" snapshot handling in logical decoding

...

The other reason to suspect this is related is that the fix also made it
to REL_11_STABLE at that time, and if you check the buildfarm data [3],
you'll see 11 fails on nightjar too, from time to time.

This means it's not a 12+ only issue, it's a live issue on 11. I don't
know if f49a80c is the culprit, or if it simply uncovered a pre-existing
bug (e.g. due to timing).

[1] https://www.postgresql.org/message-id/CAEepm%3D0wB7vgztC5sg2nmJ-H3bnrBT5GQfhUzP%2BFfq-WT3g8VA%40mail.gmail.com

[2] https://commitfest.postgresql.org/18/1650/

[3] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=nightjar&br=REL_11_STABLE

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Kukushkin 2019-08-26 13:42:46 Re: Statement timeout in pg_rewind
Previous Message Asif Rehman 2019-08-26 12:53:06 Re: pgbench - allow to create partitioned tables