Re: [PATCH] Fix for infinite signal loop in parallel scan

From: Oleksii Kliukin <alexk(at)hintbits(dot)com>
To: Chris Travers <chris(dot)travers(at)adjust(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Fix for infinite signal loop in parallel scan
Date: 2018-09-17 12:59:21
Message-ID: 58C9F6AF-253E-4ADA-988D-83C926B608D1@hintbits.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 7. Sep 2018, at 17:57, Chris Travers <chris(dot)travers(at)adjust(dot)com> wrote:
>
> Hi;
>
> Attached is the patch we are fully testing at Adjust. There are a few non-obvious aspects of the code around where the patch hits. I have run make check on Linux and MacOS, and make check-world on Linux (check-world fails on MacOS on all versions and all branches due to ecpg failures). Automatic tests are difficult because it is to a rare race condition which is difficult (or possibly impossible) to automatically create. Our current approach testing is to force the issue using a debugger. Any ideas on how to reproduce automatically are appreciated but even on our heavily loaded systems this seems to be a very small portion of queries that hit this case (meaning the issue happens about once a week for us).

I did some manual testing on it, putting breakpoints in the
ResolveRecoveryConflictWithLock in the startup process on a streaming replica
(configured with a very low max_standby_streaming_delay, i.e. 100ms) and to the
posix_fallocate call on the normal backend on the same replica. At this point I
also instructed gdb not to stop upon receiving SIGUSR1 (handle SIGUSR1 nonstop)
and resumed execution on both the backend and the startup process.

Then I simulated a conflict by creating a rather big (3GB) table on the master,
doing some updates on it and then running an aggregate on the replica backend
(i.e. 'select count(1) from test' with 'force_parallel_mode = true') where I
set the breakpoint. The aggregate and force_parallel_mode ensured that
the query was executed as a parallel one, leading to allocation of the DSM

Once the backend reached the posix_fallocate call and was waiting on a
breakpoint, I called 'vacuum full test' on the master that lead to a conflict
on the replica running 'select from test' (in a vast majority of cases),
triggering the breakpoint in ResolveRecoveryConflictWithLock in the startup
process, since the startup process tried to cancel the conflicting backend. At
that point I continued execution of the startup process (which would loop in
CancelVirtualTransaction sending SIGUSR1 to the backend while the backend
waited to be resumed from the breakpoint). Right after that, I changed the size
parameter on the backend to something that would make posix_fallocate run for a
bit longer, typically ten to hundred MB. Once the backend process was resumed,
it started receiving SIGUSR1 from the startup process, resulting in
posix_fallocate existing with EINTR.

With the patch applied, the posix_fallocate loop terminated right away (because
of QueryCancelPending flag set to true) and the backend went through the
cleanup, showing an ERROR of cancelling due to the conflict with recovery.
Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while
the startup process were looping forever, trying to send SIGUSR1.

One thing I’m wondering is whether we could do the same by just blocking SIGUSR1
for the duration of posix_fallocate?

Cheers,
Oleksii Kliukin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chris Travers 2018-09-17 13:05:08 Re: [PATCH] Fix for infinite signal loop in parallel scan
Previous Message Jonathan S. Katz 2018-09-17 12:44:36 Re: Stored procedures and out parameters