Re: Corruption during WAL replay

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: andres(at)anarazel(dot)de
Cc: tejeswarm(at)hotmail(dot)com, pgsql-hackers(at)postgresql(dot)org, hexexpert(at)comcast(dot)net
Subject: Re: Corruption during WAL replay
Date: 2020-03-31 07:36:39
Message-ID: 20200331.163639.1136080897926694557.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Mon, 30 Mar 2020 16:31:59 -0700, Andres Freund <andres(at)anarazel(dot)de> wrote in
> Hi,
>
> On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:
> > At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm(at)hotmail(dot)com> wrote in
> > > The original bug reporting-email and the relevant discussion is here
> > ...
> > > The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.
> >
> > BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.
>
> I don't think that's true. For any of this to be relevant the buffer has
> to be dirty. In which case BufferAlloc() has to call
> FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
> set.
>
> What path are you thinking of? Or alternatively, what am I missing?

# I would be wrong with far low odds..

"doesn't" is overstated. Is there a case where the buffer is already
flushed by checkpoint? (If that is the case, dropping clean buffers at
marking truncate would work?)

> > I'm not sure it's acceptable to remember all to-be-deleted buffers
> > while truncation.
>
> I don't see a real problem with it. Nor really a good alternative. Note
> that for autovacuum truncations we'll only truncate a limited number of
> buffers at once, and for most relation truncations we don't enter this
> path (since we create a new relfilenode instead).

Thank you for the opinion. I agree to that.

> > + /*START_CRIT_SECTION();*/
>
> > Is this a point of argument? It is not needed if we choose the
> > strategy (c) in [1], since the protocol is aiming to allow server to
> > continue running after truncation failure.
> >
> > [1]: https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de
>
> I think it's entirely broken to continue running after a truncation
> failure. We obviously have to first WAL log the truncation (since
> otherwise we can crash just after doing the truncation). But we cannot
> just continue running after WAL logging, but not performing the
> associated action: The most obvious reason is that otherwise a replica
> will execute the trunction, but the primary will not.

Hmm. If we allow PANIC on truncation failure why do we need to go on
the complicated steps? Wouldn't it enough to enclose the sequence
(WAL insert - drop buffers - truncate) in a critical section? I
believed that this project aims to fix the db-breakage on truncation
failure by allowing rollback on truncation failure?

> The whole justification for that behaviour "It would turn a usually
> harmless failure to truncate, that might spell trouble at WAL replay,
> into a certain PANIC." was always dubious (since on-disk and in-memory
> state now can diverge), but it's clearly wrong once replication had
> entered the picture. There's just no alternative to a critical section
> here.

Yeah, I like that direction.

> If we are really concerned with truncation failing - I don't know why we
> would be, we accept that we have to be able to modify files etc to stay
> up - we can add a pre-check ensuring that permissions are set up
> appropriately to allow us to truncate.

I think the question above is the core part of the problem.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2020-03-31 08:05:55 Re: tweaking perfect hash multipliers
Previous Message Julien Rouhaud 2020-03-31 07:33:21 Re: Planning counters in pg_stat_statements (using pgss_store)