Re: archive status ".ready" files may be created too early

From: Andres Freund <andres(at)anarazel(dot)de>
To: "alvherre(at)alvh(dot)no-ip(dot)org" <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "x4mmm(at)yandex-team(dot)ru" <x4mmm(at)yandex-team(dot)ru>, "a(dot)lubennikova(at)postgrespro(dot)ru" <a(dot)lubennikova(at)postgrespro(dot)ru>, "hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>, "matsumura(dot)ryo(at)fujitsu(dot)com" <matsumura(dot)ryo(at)fujitsu(dot)com>, "masao(dot)fujii(at)gmail(dot)com" <masao(dot)fujii(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: archive status ".ready" files may be created too early
Date: 2021-08-30 21:05:29
Message-ID: 20210830210529.bd7wmqodt747xa2y@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-08-30 15:51:54 -0400, alvherre(at)alvh(dot)no-ip(dot)org wrote:
> On 2021-Aug-28, Andres Freund wrote:
>
> > While rebasing the aio patchset ontop of HEAD I noticed that this commit added
> > another atomic operation to XLogWrite() with archiving enabled. The WAL write
> > path is really quite hot, and at least some of the
> > NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held.
> >
> > I think we should at least try to make the fast-path where no segment
> > boundaries were crossed use no atomic operations.
>
> I think the best way to achieve this is is to rely completely on
> walwriter doing the segment notification, so that the WAL write done by
> backend would only do a latch set.

When were you thinking of doing the latch sets? Adding a latch set for every
XLogWrite() wouldn't be cheap either. Both because syscalls under a lock
aren't free and because waking up walsender even more often isn't free (we
already have a few threads about reducing the signalling frequency).

There's also the question of what to do with single user mode. We shouldn't
just skip creating .ready files there...

Although, the more I think about, the more I am confused about the trailing
if (XLogArchivingActive())
NotifySegmentsReadyForArchive(LogwrtResult.Flush);

in XLogWrite(). Shouldn't that at the very least be inside the "If asked to
flush, do so" branch? Outside that and the finishing_seg branch
LogwrtResult.Flush won't have moved, right? So the call to
NotifySegmentsReadyForArchive() can't do anything, no?

Nor does it seem like we'd ever need to call NotifySegmentsReadyForArchive()
if we started writing on the current page - flushRecPtr can't move across a
segment boundary in that case.

I hadn't yet realized that this commit doesn't just make XLogWrite() more
expensive, it also makes XLogInsertRecord() more expensive :(. Adding two
divisions to XLogInsertRecord() isn't nice, especially as it happens
even if !XLogArchivingActive().

I can't really convince myself this deals correctly with multiple segment
spanning records and with records spanning more than one segment? It'd be
easier to understand if the new XLogCtlData variables were documented...

If there's one record from segment s0 to s1 and one from s1 to s4, and
wal_buffers is big enough to contain them all, the first record will set
earliestSegBoundary = s1
the second
latestSegBoundary = s4.

When s1 is fully written out, NotifySegmentsReadyForArchive() will set
earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 -
ok. But when when s2 is flushed, we'll afaict happily create .ready files
for s2, s3 despite s4 not yet being written, because earliestSegBoundary is
now s4.

I think there's other issues as well.

The more I look at this commit, the less I believe it's right.

The whole approach here of delaying .ready creation for these types of
segments seems wrong to me. Doesn't the exact same problem also exist for
streaming rep - which one can also use to maintain a PITR archive? walsender
sends up to the flush location, and pg_receivewal's FindStreamingStart() will
afaict just continue receiving from after that point.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhihong Yu 2021-08-30 21:11:34 Re: dynamic result sets support in extended query protocol
Previous Message David Christensen 2021-08-30 20:55:10 Re: [PATCH] Error out if SKIP LOCKED and WITH TIES are both specified