Re: archive status ".ready" files may be created too early

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: bossartn(at)amazon(dot)com
Cc: a(dot)lubennikova(at)postgrespro(dot)ru, hlinnaka(at)iki(dot)fi, matsumura(dot)ryo(at)fujitsu(dot)com, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: archive status ".ready" files may be created too early
Date: 2020-12-15 10:32:57
Message-ID: 20201215.193257.1534297881230796788.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Mon, 14 Dec 2020 18:25:23 +0000, "Bossart, Nathan" <bossartn(at)amazon(dot)com> wrote in
> Apologies for the long delay.
>
> I've spent a good amount of time thinking about this bug and trying
> out a few different approaches for fixing it. I've attached a work-
> in-progress patch for my latest attempt.
>
> On 10/13/20, 5:07 PM, "Kyotaro Horiguchi" <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > F0 F1
> > AAAAA F BBBBB
> > |---------|---------|---------|
> > seg X seg X+1 seg X+2
> >
> > Matsumura-san has a concern about the case where there are two (or
> > more) partially-flushed segment-spanning records at the same time.
> >
> > This patch remembers only the last cross-segment record. If we were
> > going to flush up to F0 after Record-B had been written, we would fail
> > to hold-off archiving seg-X. This patch is based on a assumption that
> > that case cannot happen because we don't leave a pending page at the
> > time of segment switch and no records don't span over three or more
> > segments.
>
> I wonder if these are safe assumptions to make. For your example, if
> we've written record B to the WAL buffers, but neither record A nor B
> have been written to disk or flushed, aren't we still in trouble?

You're right in that regard. There's a window where partial record is
written when write location passes F0 after insertion location passes
F1. However, remembering all spanning records seems overkilling to me.

I modifed the previous patch so that it remembers the start LSN of the
*oldest* corss-segment continuation record in the last consecutive
bonded segments, and the end LSN of the latest cross-segmetn
continuation record. This doesn't foreget past segment boundaries.
The region is cleard when WAL-write LSN goes beyond the remembered end
LSN. So the region may contain several wal-segments that are not
connected to the next one, but that doesn't matter so much.

> Also, is there actually any limit on WAL record length that means that
> it is impossible for a record to span over three or more segments?

Even though it is not a hard limit, AFAICS as mentioned before the
longest possible record is what log_newpages() emits. that is up to
about 500kBytes for now. I think we don't want to make the length
longer. If we set the wal_segment_size to 1MB and set the block size
to 16kB or more, we would have a recrod spanning over three or more
segments but I don't think that is a sane configuration and that kind
of issue could happen elsewhere.

> Perhaps these assumptions are true, but it doesn't seem obvious to me
> that they are, and they might be pretty fragile.

I added an assertion that a record must be shorter than a wal segment
to XLogRecordAssemble(). This guarantees the assumption to be true?
(The condition is tentative, would need to be adjusted.)

> The attached patch doesn't make use of these assumptions. Instead, we
> track the positions of the records that cross segment boundaries in a
> small hash map, and we use that to determine when it is safe to mark a
> segment as ready for archival. I think this approach resembles
> Matsumura-san's patch from June.
>
> As before, I'm not handling replication, archive_timeout, and
> persisting latest-marked-ready through crashes yet. For persisting
> the latest-marked-ready segment through crashes, I was thinking of
> using a new file that stores the segment number.

Also, the attached is a PoC.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
v3-0001-PoC-Avoid-archiving-a-WAL-segment-that-continues-.patch text/x-patch 10.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexey Kondratov 2020-12-15 10:34:35 Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Previous Message Peter Smith 2020-12-15 10:01:41 Re: Single transaction in the tablesync worker?