Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: michael(at)paquier(dot)xyz
Cc: tomas(dot)vondra(at)2ndquadrant(dot)com, thunder1(at)126(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node
Date: 2019-09-24 03:46:19
Message-ID: 20190924.124619.248088532.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello.

At Tue, 24 Sep 2019 10:40:19 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in <20190924014019(dot)GB2012(at)paquier(dot)xyz>
> On Mon, Sep 23, 2019 at 01:45:14PM +0200, Tomas Vondra wrote:
> > On Mon, Sep 23, 2019 at 03:48:50PM +0800, Thunder wrote:
> >> Is this an issue?
> >> Can we fix like this?
> >> Thanks!
> >>
> >
> > I do think it is a valid issue. No opinion on the fix yet, though.
> > The report was sent on saturday, so patience ;-)
>
> And for some others it was even a longer weekend. Anyway, the problem
> can be reproduced if you apply the attached which introduces a failure
> point, and then if you run the following commands:
> create table aa as select 1;
> delete from aa;
> \! touch /tmp/truncate_flag
> vacuum aa;
> \! rm /tmp/truncate_flag
> vacuum aa; -- panic on standby
>
> This also points out that there are other things to worry about than
> interruptions, as for example DropRelFileNodeLocalBuffers() could lead
> to an ERROR, and this happens before the physical truncation is done
> but after the WAL record is replayed on the standby, so any failures
> happening at the truncation phase before the work is done would be a

Indeed.

> problem. However we are talking about failures which should not
> happen and these are elog() calls. It would be tempting to add a
> critical section here, but we could still have problems if we have a
> failure after the WAL record has been flushed, which means that it
> would be replayed on the standby, and the surrounding comments are

Agreed.

> clear about that. In short, as a matter of safety I'd like to think
> that what you are suggesting is rather acceptable (aka hold interrupts
> before the WAL record is written and release after the physical
> truncate), so as truncation avoids failures possible to avoid.
>
> Do others have thoughts to share on the matter?

Agreed for the concept, but does the patch work as described? It
seems that query cancel doesn't fire during the holded-off
section since no CHECK_FOR_INTERRUPTS() there.

regares.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2019-09-24 04:30:18 Re: Efficient output for integer types
Previous Message Michael Paquier 2019-09-24 02:39:52 Re: [PATCH] src/test/modules/dummy_index -- way to test reloptions from inside of access method