Re: Timeout failure in 019_replslot_limit.pl

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: michael(at)paquier(dot)xyz, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Timeout failure in 019_replslot_limit.pl
Date: 2021-09-17 21:59:24
Message-ID: 202109172159.wd2jxfvabfbw@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021-Sep-07, Kyotaro Horiguchi wrote:

> It seems like the "kill 'STOP'" in the script didn't suspend the
> processes before advancing WAL. The attached uses 'ps' command to
> check that since I didn't come up with the way to do the same in Perl.

Ah! so we tell the kernel to send the signal, but there's no guarantee
about the timing for the reaction from the other process. Makes sense.

Your proposal is to examine the other process' state until we see that
it gets the T flag. I wonder how portable this is; I suspect not very.
`ps` is pretty annoying, meaning not consistently implemented -- GNU's
manpage says there are "UNIX options", "BSD options" and "GNU long
options", so it seems hard to believe that there is one set of options
that will work everywhere.

I found a Perl module (Proc::ProcessTable) that can be used to get the
list of processes and their metadata, but it isn't in core Perl and it
doesn't look very well maintained either, so that one's out.

Another option might be to wait on the kernel -- do something that would
involve the kernel taking action on the other process, acting like a
barrier of sorts. I don't know if this actually works, but we could
try. Something like sending SIGSTOP first, then "kill 0" -- or just
send SIGSTOP twice:

diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index e065c5c008..e8f323066a 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -346,6 +346,8 @@ $logstart = get_log_size($node_primary3);
# freeze walsender and walreceiver. Slot will still be active, but walreceiver
# won't get anything anymore.
kill 'STOP', $senderpid, $receiverpid;
+kill 'STOP', $senderpid, $receiverpid;
+
advance_wal($node_primary3, 2);

my $max_attempts = 180;

> + # Haven't found the means to do the same on Windows
> + return if $TestLib::windows_os;

I suppose if it came down to something like your patch, we could do
something simple here like "if Windows, sleep 2s and hope for the best".

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Ellos andaban todos desnudos como su madre los parió, y también las mujeres,
aunque no vi más que una, harto moza, y todos los que yo vi eran todos
mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Cameron Murdoch 2021-09-17 23:09:49 Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
Previous Message Greg Stark 2021-09-17 21:35:58 Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert