Re: [HACKERS] make async slave to wait for lsn to be replayed

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Anna Akenteva <a(dot)akenteva(at)postgrespro(dot)ru>
Cc: Kartyshov Ivan <i(dot)kartyshov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] make async slave to wait for lsn to be replayed
Date: 2020-04-07 09:58:39
Message-ID: CAA4eK1JC-CSNXjRt3Ct4MFSX3owaD7J34gjzE7GamzfwBFfuCg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 7, 2020 at 7:56 AM Anna Akenteva <a(dot)akenteva(at)postgrespro(dot)ru> wrote:
>
> On 2020-04-07 00:58, Kartyshov Ivan wrote:
> > Ok, here is a new version of patch with single LSN and TIMEOUT.
>
> I had a look at the code and did some more code cleanup, with Ivan's
> permission.
> This is what I did:
> - Removed "WAIT FOR" command tag from cmdtaglist.h and renamed WaitStmt
> to WaitClause (since there's no standalone WAIT FOR command anymore)
> - Added _copyWaitClause() and _equalWaitClause()
> - Removed unused #include-s from utility.c
> - Adjusted tests and documentation
> - Fixed/added some code comments
>
> I have a couple of questions about WaitUtility() though:
> - When waiting forever (due to not specifying a timeout), isn't 60
> seconds too long of an interval to check for interrupts?
> - If we did specify a timeout, it might be a very long one. In this
> case, shouldn't we also make sure to wake up sometimes to check for
> interrupts?
>

Right, we should probably wait for 100ms before checking the
interrupts. See the similar logic in pg_promote where we wait for
specified number of seconds.

> - Is it OK that specifying timeout = 0 (BEGIN WAIT FOR LSN ... TIMEOUT
> 0) is the same as not specifying timeout at all?
>

Yes that sounds reasonable to me.

Review comments:
--------------------------
1.
+/*
+ * Delete wait event of the current backend from the shared memory array.
+ *
+ * TODO: Consider state cleanup on backend failure.
+ * Check:
+ * 1) nomal|smart|fast|immediate stop
+ * 2) SIGKILL and SIGTERM
+ */
+static void
+DeleteEvent(void)

I don't see how this is implemented or called to handle any errors.
For example in function WaitUtility if the WaitLatch errors out due to
any error, then the event won't be deleted. I think we can't assume
WaitLatch or any other code in this code path will never error out.
For ex. WaitLatch---->WaitEventSetWaitBlock() can error out. Also, in
future we can add more code which can error out.

2.
+ /*
+ * If received an interruption from CHECK_FOR_INTERRUPTS,
+ * then delete the current event from array.
+ */
+ if (InterruptPending)
+ {
+ DeleteEvent();
+ ProcessInterrupts();
+ }

We generally do this type of handling via CHECK_FOR_INTERRUPTS. One
reason is that it behaves slightly differently in Windows. I am not
sure why we want to do differently here? This looks quite adhoc to me
and may not be correct. If we handle this event in error path, then
we might not need to do some special handling.

3.
+/*
+ * On WAIT use a latch to wait till LSN is replayed,
+ * postmaster dies or timeout happens.
+ * Returns 1 if LSN was reached and 0 otherwise.
+ */
+int
+WaitUtility(XLogRecPtr target_lsn, const float8 secs)

Isn't it better to have a return value as bool? IOW, why this
function need int as its return value?

4.
+#define GetNowFloat() ((float8) GetCurrentTimestamp() / 1000000.0)

This same define is used elsewhere in the code as well, may be we can
define it in some central place and use it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2020-04-07 10:00:29 Re: WAL usage calculation patch
Previous Message Peter Eisentraut 2020-04-07 09:51:32 Re: pgsql: Generate backup manifests for base backups, and validate them.