Re: archive status ".ready" files may be created too early

From: Andres Freund <andres(at)anarazel(dot)de>
To: "Bossart, Nathan" <bossartn(at)amazon(dot)com>
Cc: "alvherre(at)alvh(dot)no-ip(dot)org" <alvherre(at)alvh(dot)no-ip(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "x4mmm(at)yandex-team(dot)ru" <x4mmm(at)yandex-team(dot)ru>, "a(dot)lubennikova(at)postgrespro(dot)ru" <a(dot)lubennikova(at)postgrespro(dot)ru>, "hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>, "matsumura(dot)ryo(at)fujitsu(dot)com" <matsumura(dot)ryo(at)fujitsu(dot)com>, "masao(dot)fujii(at)gmail(dot)com" <masao(dot)fujii(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: archive status ".ready" files may be created too early
Date: 2021-08-31 07:43:38
Message-ID: 20210831074338.mt7hpaveuyvb723d@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

On 2021-08-31 06:45:06 +0000, Bossart, Nathan wrote:
> On 8/30/21, 7:39 PM, "Andres Freund" <andres(at)anarazel(dot)de> wrote:
> > On 2021-08-30 22:39:04 +0000, Bossart, Nathan wrote:
> >> If we called NotifySegmentsReadyForArchive() before we updated the
> >> flush location in shared memory, we might skip nudging the WAL writer
> >> even though we should.
> >
> > That's trivial to address - just have a local variable saying whether we need
> > to call NotifySegmentsReadyForArchive().
>
> I think we can remove the race condition entirely by moving boundary
> registration to before WALInsertLockRelease(). I attached a patch for
> discussion.

I think it's a bad idea to move more code to before
WALInsertLockRelease. There's a very limited number of xlog insert slots, and
WAL flushes (e.g. commits) need to wait for insertions to finish.

> >> > When s1 is fully written out, NotifySegmentsReadyForArchive() will set
> >> > earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 -
> >> > ok. But when when s2 is flushed, we'll afaict happily create .ready files
> >> > for s2, s3 despite s4 not yet being written, because earliestSegBoundary is
> >> > now s4.
> >>
> >> In this case, the .ready files for s2 and s3 wouldn't be created until
> >> s4 is flushed to disk.
> >
> > I don't think that's true as the code stands today? The
> > NotifySegmentsReadyForArchive() for s2 will update earliestSegBoundary to s4,
> > because latestSegBoundary = 4 and earliestSegBoundary = 1, triggering the
> > keep_latest branch. Any subsequent NotifySegmentsReadyForArchive() with a
> > segment < 4 will then be able to flush s2 and s3?
>
> When flushRecPtr is less than both of the segment boundaries,
> NotifySegmentsReadyForArchive() will return without doing anything.
> At least, that was the intent. If there is some reason it's not
> actually working that way, I can work on fixing it.

But that's not OK either! Consider a scenario when there's small records each
spanning just a bit into the next segment, and initially all the data is in
wal_buffers.

RegisterSegmentBoundary(s1, s1+10)
earliestSegBoundary = s1
earliestSegBoundaryEndPtr = s1+10

RegisterSegmentBoundary(s2, s2+10)
earliestSegBoundary = s1
earliestSegBoundaryEndPtr = s1+10
latestSegBoundary = s2
latestSegBoundaryEndPtr = s2 + 10

RegisterSegmentBoundary(s3, s3+10)
earliestSegBoundary = s1
earliestSegBoundaryEndPtr = s1+10
latestSegBoundary = s2
latestSegBoundaryEndPtr = s2 + 10

RegisterSegmentBoundary(s4, s4+10)
earliestSegBoundary = s1
earliestSegBoundaryEndPtr = s1+10
latestSegBoundary = s4
latestSegBoundaryEndPtr = s4 + 10

If there's now a flush request including all of s3, we'll have the following
sequence of notifies:

NotifySegmentsReadyForArchive(s1)
nothing happens, smaller than s1+10

NotifySegmentsReadyForArchive(s2)
earliestSegBoundary = s4
earliestSegBoundaryEndPtr = s4+10
latestSegBoundary = s4
latestSegBoundaryEndPtr = s4 + 10
latest_boundary_seg = s1

NotifySegmentsReadyForArchive(s3)
nothing happens, flush is smaller than s4

If the record ending at s4 + 10 isn't an async commit (and thus
XLogCtl->asyncXactLSN is smaller), and there are no further records, we can
end up waiting effectively forever for s2 (and s3) to be archived. If all
other connections (and autovac etc) are idle, there will be no XLogFlush()
calls, nor will XLogBackgroundFlush() do anything, because it'll hit the "If
already known flushed" path, because the the first page in s4 is only
partially filled.

Am I missing something?

> > I don't think it's sensible to fix these separately. It'd be one thing to do
> > that for HEAD, but on the back branches? And that this patch hasn't gotten any
> > performance testing is scary.
>
> Are there any specific performance tests you'd like to see? I don't
> mind running a couple.

- Parallel copy with > 8 processes

- Parallel non-transactional insertion of small-medium records
Simulates inserting rows within a transaction
- Parallel transactional insertion of small-medium sized records, with fsync=on
Plain oltp writes
- Parallel transactional insertion of small-medium sized records, with fsync=off
fsync=off to simulate a fast server-class SSD (where fsync is
instantaneous). Of course, if you have one of those, you can also use that.

For the oltp ones I've had good experience simulating workloads with
pg_logical_emit_message(). That just hits the WAL path, *drastically* reducing
the variance / shortening the required test duration.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2021-08-31 07:56:39 Re: Fix around conn_duration in pgbench
Previous Message Kyotaro Horiguchi 2021-08-31 07:35:36 Re: Possible missing segments in archiving on standby