Re: Timeout failure in 019_replslot_limit.pl

From: Noah Misch <noah(at)leadboat(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, michael(at)paquier(dot)xyz, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Timeout failure in 019_replslot_limit.pl
Date: 2021-09-18 03:41:00
Message-ID: 20210918034100.GA2913772@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 17, 2021 at 06:59:24PM -0300, Alvaro Herrera wrote:
> On 2021-Sep-07, Kyotaro Horiguchi wrote:
> > It seems like the "kill 'STOP'" in the script didn't suspend the
> > processes before advancing WAL. The attached uses 'ps' command to
> > check that since I didn't come up with the way to do the same in Perl.
>
> Ah! so we tell the kernel to send the signal, but there's no guarantee
> about the timing for the reaction from the other process. Makes sense.

Agreed.

> Your proposal is to examine the other process' state until we see that
> it gets the T flag. I wonder how portable this is; I suspect not very.
> `ps` is pretty annoying, meaning not consistently implemented -- GNU's
> manpage says there are "UNIX options", "BSD options" and "GNU long
> options", so it seems hard to believe that there is one set of options
> that will work everywhere.

I like this, and it's the most-robust way. I agree there's no portable way,
so I'd modify it to be fail-open. Run a "ps" command that works on the OP's
system. If the output shows the process in a state matching [DRS], we can
confidently sleep a bit for signal delivery to finish. If the command fails
or prints something else (including state T, which we need check explicitly),
assume SIGSTOP delivery is complete. If some other platform shows this race
in the future, we can add an additional "ps" command.

If we ever get the "stop events" system
(https://postgr.es/m/flat/CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ(at)mail(dot)gmail(dot)com),
it would be useful for crafting this kind of test without problem seen here.

> I found a Perl module (Proc::ProcessTable) that can be used to get the
> list of processes and their metadata, but it isn't in core Perl and it
> doesn't look very well maintained either, so that one's out.

Agreed, that one's out.

> Another option might be to wait on the kernel -- do something that would
> involve the kernel taking action on the other process, acting like a
> barrier of sorts. I don't know if this actually works, but we could
> try. Something like sending SIGSTOP first, then "kill 0" -- or just
> send SIGSTOP twice:
>
> diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
> index e065c5c008..e8f323066a 100644
> --- a/src/test/recovery/t/019_replslot_limit.pl
> +++ b/src/test/recovery/t/019_replslot_limit.pl
> @@ -346,6 +346,8 @@ $logstart = get_log_size($node_primary3);
> # freeze walsender and walreceiver. Slot will still be active, but walreceiver
> # won't get anything anymore.
> kill 'STOP', $senderpid, $receiverpid;
> +kill 'STOP', $senderpid, $receiverpid;
> +
> advance_wal($node_primary3, 2);
>
> my $max_attempts = 180;

If this fixes things for the OP, I'd like it slightly better than the "ps"
approach. It's less robust, but I like the brevity.

Another alternative might be to have walreceiver reach walsender via a proxy
Perl script. Then, make that proxy able to accept an instruction to pause
passing data until further notice. However, I like two of your options better
than this one.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-09-18 06:18:16 Re: Teach pg_receivewal to use lz4 compression
Previous Message Neil Chen 2021-09-18 02:41:54 Re: psql: \dl+ to list large objects privileges