Re: archive status ".ready" files may be created too early

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "Bossart, Nathan" <bossartn(at)amazon(dot)com>
Cc: "alvherre(at)alvh(dot)no-ip(dot)org" <alvherre(at)alvh(dot)no-ip(dot)org>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "x4mmm(at)yandex-team(dot)ru" <x4mmm(at)yandex-team(dot)ru>, "a(dot)lubennikova(at)postgrespro(dot)ru" <a(dot)lubennikova(at)postgrespro(dot)ru>, "hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>, "matsumura(dot)ryo(at)fujitsu(dot)com" <matsumura(dot)ryo(at)fujitsu(dot)com>, "masao(dot)fujii(at)gmail(dot)com" <masao(dot)fujii(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: archive status ".ready" files may be created too early
Date: 2021-08-20 17:04:58
Message-ID: CA+TgmoYgB6_oS9ytKFjhL18fTA-voYjcT4h-CdtuPYPPbCgM=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 20, 2021 at 12:36 PM Bossart, Nathan <bossartn(at)amazon(dot)com> wrote:
> If a record spans multiple segments, we only register one segment
> boundary. For example, if I insert a record that starts at segment
> number 1 and stops at 10, I'll insert one segment boundary for segment
> 10. We'll only create .ready files for segments 1 through 9 once this
> record is completely flushed to disk.

Oh ... OK. So is there any experimental scenario in which the hash
table ends up with more than 1 entry? And if so, how does that happen?

> > It's actually not clear to me why we need to track multiple entries
> > anyway. The scenario postulated by Horiguchi-san in
> > https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota.ntt@gmail.com
> > seems to require that the write position be multiple segments ahead of
> > the flush position, but that seems impossible with the present code,
> > because XLogWrite() calls issue_xlog_fsync() at once if the segment is
> > filled. So I think, at least with the present code, any record that
> > isn't completely flushed to disk has to be at least partially in the
> > current segment. And there can be only one record that starts in some
> > earlier segment and ends in this one.
>
> We register the boundaries XLogInsertRecord(), which AFAICT just bumps
> the global write request pointer ahead, so I'm not sure we can make
> any assumptions about what is written/flushed at that time. (I see
> that we do end up calling XLogFlush() for XLOG_SWITCH records in
> XLogInsertRecord(), but I don't see any other cases where we actually
> write anything in this function.) Am I missing something?

Well, I'm not sure. But I *think* that the code as it exists today is
smart enough not to try to archive a segment that hasn't been
completely flushed, and the gap is only that even though the segment
might be completely flushed, some portion of the record that is part
of a later segment might not be flushed, and thus after a crash we
might overwrite the already-flushed contents. The patch can make an
implementation choice to do some work at XLogInsertRecord() time if it
likes, but there's no real hazard at that point. The hazard only
exists, or so I think, once a segment that contains part of the record
is fully on disk. But that means, if my previous logic is correct,
that the hazard can only exist for at most 1 record at any point in
time.

> If there isn't a way to ensure that the number of entries we need to
> store is bounded, I'm tempted to propose my original patch [0], which
> just moves .ready file creation to the very end of XLogWrite(). It's
> probably not a complete solution, but it might be better than what's
> there today.

Doesn't that allocate memory inside a critical section? I would have
thought it would cause an immediate assertion failure.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2021-08-20 17:05:49 Re: badly calculated width of emoji in psql
Previous Message Pavel Stehule 2021-08-20 16:41:02 Re: Improving some plpgsql error messages