Re: Trap errors from streaming child in pg_basebackup to exit early

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Trap errors from streaming child in pg_basebackup to exit early
Date: 2022-02-16 07:27:13
Message-ID: YgynUafCyIu3jIhC@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 29, 2021 at 01:18:40PM +0200, Daniel Gustafsson wrote:
> So there is one mention of a background WAL receiver already in there, but it's
> pretty inconsistent as to what we call it. For now I've changed the messaging
> in this patch to say "background process", leaving making this all consistent
> for a follow-up patch.
>
> The attached fixes the above, as well as the typo mentioned off-list and is
> rebased on top of todays HEAD.

I have been looking a bit at this patch, and did some tests on Windows
to find out that this is able to catch the failure of the thread
streaming the WAL segments in pg_basebackup, avoiding a completion of
the base backup, while HEAD waits until the backup finishes. Testing
this scenario is actually simple by issuing pg_terminate_backend() on
the WAL sender that streams the WAL with START_REPLICATION, while
throttling the base backup.

Could you add a test to automate this scenario? As far as I can see,
something like the following should be stable even for Windows:
1) Run a pg_basebackup in the background with IPC::Run, using
--max-rate with a minimal value to slow down the base backup, for slow
machines. 013_crash_restart.pl does that as one example with $killme.
2) Find out the WAL sender doing START_REPLICATION in the backend, and
issue pg_terminate_backend() on it.
3) Use a variant of pump_until() on the pg_basebackup process and
check after one or more failure patterns. We should refactor this
part, actually. If this new test uses the same logic, that would make
three tests doing that with 022_crash_temp_files.pl and
013_crash_restart.pl. The CI should be fine to provide any feedback
with the test in place, though I am fine to test things also in my
box.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2022-02-16 07:28:53 some aspects of our qsort might not be ideal
Previous Message Julien Rouhaud 2022-02-16 07:18:49 Re: Observability in Postgres