Re: archive status ".ready" files may be created too early

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "alvherre(at)alvh(dot)no-ip(dot)org" <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "x4mmm(at)yandex-team(dot)ru" <x4mmm(at)yandex-team(dot)ru>, "a(dot)lubennikova(at)postgrespro(dot)ru" <a(dot)lubennikova(at)postgrespro(dot)ru>, "hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>, "matsumura(dot)ryo(at)fujitsu(dot)com" <matsumura(dot)ryo(at)fujitsu(dot)com>, "masao(dot)fujii(at)gmail(dot)com" <masao(dot)fujii(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: archive status ".ready" files may be created too early
Date: 2021-08-20 15:27:58
Message-ID: CA+TgmoaaOA0pnJ3=j2Ao7PO7Obo6ShYyqXvtM8+daGmnq401zg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 20, 2021 at 10:50 AM alvherre(at)alvh(dot)no-ip(dot)org
<alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> 1. We use a hash table in shared memory. That's great. The part that's
> not so great is that in both places where we read items from it, we
> have to iterate in some way. This seems a bit silly. An array would
> serve us better, if only we could expand it as needed. However, in
> shared memory we can't do that. (I think the list of elements we
> need to memoize is arbitrary long, if enough processes can be writing
> WAL at the same time.)

We can't expand the hash table either. It has an initial and maximum
size of 16 elements, which means it's basically an expensive array,
and which also means that it imposes a new limit of 16 *
wal_segment_size on the size of WAL records. If you exceed that limit,
I think things just go boom... which I think is not acceptable. I
think we can have records in the multi-GB range of wal_level=logical
and someone chooses a stupid replica identity setting.

It's actually not clear to me why we need to track multiple entries
anyway. The scenario postulated by Horiguchi-san in
https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota.ntt@gmail.com
seems to require that the write position be multiple segments ahead of
the flush position, but that seems impossible with the present code,
because XLogWrite() calls issue_xlog_fsync() at once if the segment is
filled. So I think, at least with the present code, any record that
isn't completely flushed to disk has to be at least partially in the
current segment. And there can be only one record that starts in some
earlier segment and ends in this one.

I will be the first to admit that the forced end-of-segment syncs
suck. They often stall every backend in the entire system at the same
time. Everyone fills up the xlog segment really fast and then stalls
HARD while waiting for that sync to happen. So it's arguably better
not to do more things that depend on that being how it works, but I
think needing a variable-size amount of shared memory is even worse.
If we're going to track multiple entries here we need some rule that
bounds how many of them we can need to track. If the number of entries
is defined by the number of segment boundaries that a particular
record crosses, it's effectively unbounded, because right now WAL
records can be pretty much arbitrarily big.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema 2021-08-20 15:29:52 Re: [EXTERNAL] Re: Allow declaration after statement and reformat code to use it
Previous Message Peter Geoghegan 2021-08-20 15:06:20 Re: The Free Space Map: Problems and Opportunities