Re: archive status ".ready" files may be created too early

From: "Bossart, Nathan" <bossartn(at)amazon(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, "alvherre(at)alvh(dot)no-ip(dot)org" <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "x4mmm(at)yandex-team(dot)ru" <x4mmm(at)yandex-team(dot)ru>, "a(dot)lubennikova(at)postgrespro(dot)ru" <a(dot)lubennikova(at)postgrespro(dot)ru>, "hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>, "matsumura(dot)ryo(at)fujitsu(dot)com" <matsumura(dot)ryo(at)fujitsu(dot)com>, "masao(dot)fujii(at)gmail(dot)com" <masao(dot)fujii(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: archive status ".ready" files may be created too early
Date: 2021-08-20 16:35:59
Message-ID: 72FB62AA-27D2-404B-8A4E-4BA6633AE720@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/20/21, 8:29 AM, "Robert Haas" <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Aug 20, 2021 at 10:50 AM alvherre(at)alvh(dot)no-ip(dot)org
> <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>> 1. We use a hash table in shared memory. That's great. The part that's
>> not so great is that in both places where we read items from it, we
>> have to iterate in some way. This seems a bit silly. An array would
>> serve us better, if only we could expand it as needed. However, in
>> shared memory we can't do that. (I think the list of elements we
>> need to memoize is arbitrary long, if enough processes can be writing
>> WAL at the same time.)
>
> We can't expand the hash table either. It has an initial and maximum
> size of 16 elements, which means it's basically an expensive array,
> and which also means that it imposes a new limit of 16 *
> wal_segment_size on the size of WAL records. If you exceed that limit,
> I think things just go boom... which I think is not acceptable. I
> think we can have records in the multi-GB range of wal_level=logical
> and someone chooses a stupid replica identity setting.

If a record spans multiple segments, we only register one segment
boundary. For example, if I insert a record that starts at segment
number 1 and stops at 10, I'll insert one segment boundary for segment
10. We'll only create .ready files for segments 1 through 9 once this
record is completely flushed to disk.

I was under the impression that shared hash tables could be expanded
as necessary, but from your note and the following comment, that does
not seem to be true:

* Note: for a shared-memory hashtable, nelem needs to be a pretty good
* estimate, since we can't expand the table on the fly. But an unshared
* hashtable can be expanded on-the-fly, so it's better for nelem to be
* on the small side and let the table grow if it's exceeded. An overly
* large nelem will penalize hash_seq_search speed without buying much.

> It's actually not clear to me why we need to track multiple entries
> anyway. The scenario postulated by Horiguchi-san in
> https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota.ntt@gmail.com
> seems to require that the write position be multiple segments ahead of
> the flush position, but that seems impossible with the present code,
> because XLogWrite() calls issue_xlog_fsync() at once if the segment is
> filled. So I think, at least with the present code, any record that
> isn't completely flushed to disk has to be at least partially in the
> current segment. And there can be only one record that starts in some
> earlier segment and ends in this one.

We register the boundaries XLogInsertRecord(), which AFAICT just bumps
the global write request pointer ahead, so I'm not sure we can make
any assumptions about what is written/flushed at that time. (I see
that we do end up calling XLogFlush() for XLOG_SWITCH records in
XLogInsertRecord(), but I don't see any other cases where we actually
write anything in this function.) Am I missing something?

> I will be the first to admit that the forced end-of-segment syncs
> suck. They often stall every backend in the entire system at the same
> time. Everyone fills up the xlog segment really fast and then stalls
> HARD while waiting for that sync to happen. So it's arguably better
> not to do more things that depend on that being how it works, but I
> think needing a variable-size amount of shared memory is even worse.
> If we're going to track multiple entries here we need some rule that
> bounds how many of them we can need to track. If the number of entries
> is defined by the number of segment boundaries that a particular
> record crosses, it's effectively unbounded, because right now WAL
> records can be pretty much arbitrarily big.

If there isn't a way to ensure that the number of entries we need to
store is bounded, I'm tempted to propose my original patch [0], which
just moves .ready file creation to the very end of XLogWrite(). It's
probably not a complete solution, but it might be better than what's
there today.

Nathan

[0] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2021-08-20 16:41:02 Re: Improving some plpgsql error messages
Previous Message Ranier Vilela 2021-08-20 16:30:27 Re: [EXTERNAL] Re: Allow declaration after statement and reformat code to use it