Re: archive status ".ready" files may be created too early

From: "Bossart, Nathan" <bossartn(at)amazon(dot)com>
To: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>
Cc: "matsumura(dot)ryo(at)fujitsu(dot)com" <matsumura(dot)ryo(at)fujitsu(dot)com>, "masao(dot)fujii(at)gmail(dot)com" <masao(dot)fujii(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: archive status ".ready" files may be created too early
Date: 2020-12-14 18:25:23
Message-ID: EFF40306-8E8A-4259-B181-C84F3F06636C@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Apologies for the long delay.

I've spent a good amount of time thinking about this bug and trying
out a few different approaches for fixing it. I've attached a work-
in-progress patch for my latest attempt.

On 10/13/20, 5:07 PM, "Kyotaro Horiguchi" <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> F0 F1
> AAAAA F BBBBB
> |---------|---------|---------|
> seg X seg X+1 seg X+2
>
> Matsumura-san has a concern about the case where there are two (or
> more) partially-flushed segment-spanning records at the same time.
>
> This patch remembers only the last cross-segment record. If we were
> going to flush up to F0 after Record-B had been written, we would fail
> to hold-off archiving seg-X. This patch is based on a assumption that
> that case cannot happen because we don't leave a pending page at the
> time of segment switch and no records don't span over three or more
> segments.

I wonder if these are safe assumptions to make. For your example, if
we've written record B to the WAL buffers, but neither record A nor B
have been written to disk or flushed, aren't we still in trouble?
Also, is there actually any limit on WAL record length that means that
it is impossible for a record to span over three or more segments?
Perhaps these assumptions are true, but it doesn't seem obvious to me
that they are, and they might be pretty fragile.

The attached patch doesn't make use of these assumptions. Instead, we
track the positions of the records that cross segment boundaries in a
small hash map, and we use that to determine when it is safe to mark a
segment as ready for archival. I think this approach resembles
Matsumura-san's patch from June.

As before, I'm not handling replication, archive_timeout, and
persisting latest-marked-ready through crashes yet. For persisting
the latest-marked-ready segment through crashes, I was thinking of
using a new file that stores the segment number.

Nathan

Attachment Content-Type Size
0001-WIP-Avoid-marking-segments-as-ready-for-archival-too.patch application/octet-stream 10.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-12-14 18:59:03 Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
Previous Message John Naylor 2020-12-14 18:12:34 Re: cutting down the TODO list thread