Re: archive status ".ready" files may be created too early

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: alvherre(at)2ndquadrant(dot)com
Cc: bossartn(at)amazon(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: archive status ".ready" files may be created too early
Date: 2019-12-17 10:27:24
Message-ID: 20191217.192724.1673213777530336030.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you Alvaro for the comment (on my comment).

At Fri, 13 Dec 2019 18:33:44 -0300, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote in
> On 2019-Dec-13, Kyotaro Horiguchi wrote:
>
> > At Thu, 12 Dec 2019 22:50:20 +0000, "Bossart, Nathan" <bossartn(at)amazon(dot)com> wrote in
>
> > > The crux of the issue seems to be that XLogWrite() does not wait for
> > > the entire record to be written to disk before creating the ".ready"
> > > file. Instead, it just waits for the last page of the segment to be
> > > written before notifying the archiver. If PostgreSQL crashes before
> > > it is able to write the rest of the record, it will end up reusing the
> > > ".ready" segment at the end of crash recovery. In the meantime, the
> > > archiver process may have already processed the old version of the
> > > segment.
> >
> > Year, that can happen if the server restarted after the crash.
>
> ... which is the normal way to run things, no?

Yes. In older version (< 10), the default value for wal_level was
minimal. In 10, the default only for wal_level was changed to
replica. Still I'm not sure if restart_after_crash can be recommended
for streaming replcation...

> Why is it bad? It's the default value.

I reconsider it more deeply. And concluded that's not harm replication
as I thought.

WAL-buffer overflow may write partial continuation record and it can
be flushed immediately. That made me misunderstood that standby can
receive only the first half of a continuation record. Actually, that
write doesn't advance LogwrtResult.Flush. So standby doesn't receive a
split record on page boundary. (The cases where crashed mater is used
as new standby as-is might contaminate my thought..)

Sorry for the bogus comment. My conclusion here is that
restart_after_crash doesn't seem to harm standby immediately.

> > The standby can be incosistent at the time of master crash, so it
> > should be fixed using pg_rewind or should be recreated from a base
> > backup.
>
> Surely the master will just come up and replay its WAL, and there should
> be no inconsistency.
>
> You seem to be thinking that a standby is promoted immediately on crash
> of the master, but this is not a given.

Basically no, but it might be mixed a bit. Anyway returning to the
porposal, I think that XLogWrite can be called during at
WAL-buffer-full and it can go into the last page in a segment. The
proposed patch doesn't work since the XLogWrite call didn't write the
whole continuation record. But I'm not sure that corner-case is worth
amendint..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2019-12-17 10:56:17 Re: automating pg_config.h.win32 maintenance
Previous Message Peter Eisentraut 2019-12-17 10:27:09 Re: Allow cluster owner to bypass authentication