Re: archive status ".ready" files may be created too early

From: "Bossart, Nathan" <bossartn(at)amazon(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "x4mmm(at)yandex-team(dot)ru" <x4mmm(at)yandex-team(dot)ru>, "a(dot)lubennikova(at)postgrespro(dot)ru" <a(dot)lubennikova(at)postgrespro(dot)ru>, "hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>, "matsumura(dot)ryo(at)fujitsu(dot)com" <matsumura(dot)ryo(at)fujitsu(dot)com>, "masao(dot)fujii(at)gmail(dot)com" <masao(dot)fujii(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: archive status ".ready" files may be created too early
Date: 2021-01-27 20:16:50
Message-ID: 5FA1ACA0-ACD0-46C1-BBF5-36F1D02534B0@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/26/21, 6:36 PM, "Kyotaro Horiguchi" <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> At Tue, 26 Jan 2021 19:13:57 +0000, "Bossart, Nathan" <bossartn(at)amazon(dot)com> wrote in
>> On 12/17/20, 9:15 PM, "Kyotaro Horiguchi" <horikyota(dot)ntt(at)gmail(dot)com> wrote:
>> > At Thu, 17 Dec 2020 22:20:35 +0000, "Bossart, Nathan" <bossartn(at)amazon(dot)com> wrote in
>> >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" <horikyota(dot)ntt(at)gmail(dot)com> wrote:
>> >> > You're right in that regard. There's a window where partial record is
>> >> > written when write location passes F0 after insertion location passes
>> >> > F1. However, remembering all spanning records seems overkilling to me.
>> >>
>> >> I'm curious why you feel that recording all cross-segment records is
>> >> overkill. IMO it seems far simpler to just do that rather than try to
>> >
>> > Sorry, my words are not enough. Remembering all spanning records in
>> > *shared memory* seems to be overkilling. Much more if it is stored in
>> > shared hash table. Even though it rarely the case, it can fail hard
>> > way when reaching the limit. If we could do well by remembering just
>> > two locations, we wouldn't need to worry about such a limitation.
>>
>> I don't think it will fail if we reach max_size for the hash table.
>> The comment above ShmemInitHash() has this note:
>>
>> * max_size is the estimated maximum number of hashtable entries. This is
>> * not a hard limit, but the access efficiency will degrade if it is
>> * exceeded substantially (since it's used to compute directory size and
>> * the hash table buckets will get overfull).
>
> That description means that a shared hash has a directory with fixed
> size thus there may be synonyms, which causes degradation. Even though
> buckets are preallocated with the specified number, since the minimum
> directory size is 256, buckets are allocated at least 256 in a long
> run. Minimum on-the-fly allocation size is 32. I haven't calcuated
> further precicely, but I'm worried about the amount of spare shared
> memory the hash can allocate.

On my machine, hash_estimate_size() for the table returns 5,968 bytes.
That estimate is for a max_size of 16. In my testing, I've been able
to need up to 6 elements in this table, but that required turning off
synchronous_commit, adding a long sleep at the end of XLogWrite(), and
increasing wal_buffers substantially. This leads me to think that a
max_size of 16 elements is typically sufficient. (I may have also
accidentally demonstrated that only storing two record boundaries
could be insufficient.)

>> > Another concern about the concrete patch:
>> >
>> > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a
>> > LWLock every time XLogWrite is called while segment archive is being
>> > held off. I don't think it is acceptable and I think it could be a
>> > problem when many backends are competing on WAL.
>>
>> This is a fair point. I did some benchmarking with a few hundred
>> connections all doing writes, and I was not able to discern any
>> noticeable performance impact. My guess is that contention on this
>> new lock is unlikely because callers of XLogWrite() must already hold
>> WALWriteLock. Otherwise, I believe we only acquire ArchNotifyLock no
>> more than once per segment to record new record boundaries.
>
> Thanks. I agree that the reader-reader contention is not a problem due
> to existing serialization by WALWriteLock. Adding an entry happens
> only at segment boundary so the ArchNotifyLock doesn't seem to be a
> problem.
>
> However the function prolongs the WALWriteLock section. Couldn't we
> somehow move the call to NotifySegmentsReadyForArchive in XLogWrite
> out of the WALWriteLock section?

I don't see a clean way to do that. XLogWrite() assumes that
WALWriteLock is held when it is called, and it doesn't release it at
any point. I think we'd have to move NotifySegmentsReadyForArchive()
to the callers of XLogWrite() if we wanted to avoid holding onto
WALWriteLock for longer. Unless we can point to a measurable
performance penalty, I'm not sure this is worth it.

Nathan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2021-01-27 20:37:51 Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Previous Message Jacob Champion 2021-01-27 20:14:53 Re: Allow matching whole DN from a client certificate