Re: [HACKERS] make async slave to wait for lsn to be replayed

From: Kartyshov Ivan <i(dot)kartyshov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [HACKERS] make async slave to wait for lsn to be replayed
Date: 2024-03-22 19:42:52
Message-ID: 22f4011c9fc31b8050054cc446b0c10f@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for your feedback.

On 2024-03-20 12:11, Alexander Korotkov wrote:
> On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan
> <i(dot)kartyshov(at)postgrespro(dot)ru> wrote:
>> > 4.2 With an unreasonably high future LSN, BEGIN command waits
>> > unboundedly, shouldn't we check if the specified LSN is more than
>> > pg_last_wal_receive_lsn() error out?
>
> I think limiting wait lsn by current received lsn would destroy the
> whole value of this feature. The value is to wait till given LSN is
> replayed, whether it's already received or not.

Ok sounds reasonable, I`ll rollback the changes.

> But I don't see a problem here. On the replica, it's out of our
> control to check which lsn is good and which is not. We can't check
> whether the lsn, which is in future for the replica, is already issued
> by primary.
>
> For the case of wrong lsn, which could cause potentially infinite
> wait, there is the timeout and the manual query cancel.

Fully agree with this take.

>> > 4.3 With an unreasonably high wait time, BEGIN command waits
>> > unboundedly, shouldn't we restrict the wait time to some max
> value,
>> > say a day or so?
>> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
>> > BEGIN AFTER :'future_receive_lsn' WITHIN 100000;
>>
>> Good idea, I put it 1 day. But this limit we should to discuss.
>
> Do you think that specifying timeout in milliseconds is suitable? I
> would prefer to switch to seconds (with ability to specify fraction of
> second). This was expressed before by Alexander Lakhin.

It sounds like an interesting idea. Please review the result.

>> > https://github.com/macdice/redo-bench or similar tools?
>
> Ivan, could you do this?

Yes, test redo-bench/crash-recovery.sh
This patch on master
91.327, 1.973
105.907, 3.338
98.412, 4.579
95.818, 4.19

REL_13-STABLE
116.645, 3.005
113.212, 2.568
117.644, 3.183
111.411, 2.782

master
124.712, 2.047
117.012, 1.736
116.328, 2.035
115.662, 1.797

Strange behavior, patched version is faster then REL_13-STABLE and
master.

> I don't see this change in the patch. Normally if a process gets a
> signal, that causes WaitLatch() to exit immediately. It also exists
> immediately on query cancel. IIRC, this 1 minute timeout is needed to
> handle some extreme cases when an interrupt is missing. Other places
> have it equal to 1 minute. I don't see why we should have it
> different.

Ok, I`ll rollback my changes.

>> 4) added and expanded sections in the documentation
>
> I don't see this in the patch. I see only a short description in
> func.sgml, which is definitely not sufficient. We need at least
> everything we have in the docs before to be adjusted with the current
> approach of procedure.

I didn't find another section where to add the description of
pg_wait_lsn().
So I extend description on the bottom of the table.

>> 5) add default variant of timeout
>> pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
>> example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0)
>
> Does zero here mean no timeout? I think this should be documented.
> Also, I would prefer to see the timeout by default. Probably one
> minute would be good for default.

Lets discuss this point. Loop in function WaitForLSN is made that way,
if we choose delay=0, only then we can wait infinitely to wait LSN
without timeout. So default must be 0.

Please take one more look on the patch.

PS sorry, the strange BUG throw my mails out of thread.
https://www.postgresql.org/message-id/flat/f2ff071aa9141405bb8efee67558a058%40postgrespro.ru

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment Content-Type Size
v14_0001-Procedure-wait-lsn.patch text/x-diff 18.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2024-03-22 19:44:28 Re: Regression tests fail with musl libc because libpq.so can't be loaded
Previous Message Nathan Bossart 2024-03-22 19:35:06 Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs