From: | Chris Travers <chris(dot)travers(at)adjust(dot)com> |
---|---|
To: | |
Cc: | PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [PATCH] Fix for infinite signal loop in parallel scan |
Date: | 2018-09-17 13:15:11 |
Message-ID: | CAN-RpxBYbXho+e7JgAdi0CbUeGObUpgwN2+CGU-Lzwmc78AEcg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Sep 17, 2018 at 2:59 PM Oleksii Kliukin <alexk(at)hintbits(dot)com> wrote:
>
>
> > On 7. Sep 2018, at 17:57, Chris Travers <chris(dot)travers(at)adjust(dot)com>
> wrote:
> >
> > Hi;
> >
> > Attached is the patch we are fully testing at Adjust. There are a few
> non-obvious aspects of the code around where the patch hits. I have run
> make check on Linux and MacOS, and make check-world on Linux (check-world
> fails on MacOS on all versions and all branches due to ecpg failures).
> Automatic tests are difficult because it is to a rare race condition which
> is difficult (or possibly impossible) to automatically create. Our current
> approach testing is to force the issue using a debugger. Any ideas on how
> to reproduce automatically are appreciated but even on our heavily loaded
> systems this seems to be a very small portion of queries that hit this case
> (meaning the issue happens about once a week for us).
>
>
> I did some manual testing on it, putting breakpoints in the
> ResolveRecoveryConflictWithLock in the startup process on a streaming
> replica
> (configured with a very low max_standby_streaming_delay, i.e. 100ms) and
> to the
> posix_fallocate call on the normal backend on the same replica. At this
> point I
> also instructed gdb not to stop upon receiving SIGUSR1 (handle SIGUSR1
> nonstop)
> and resumed execution on both the backend and the startup process.
>
> Then I simulated a conflict by creating a rather big (3GB) table on the
> master,
> doing some updates on it and then running an aggregate on the replica
> backend
> (i.e. 'select count(1) from test' with 'force_parallel_mode = true')
> where I
> set the breakpoint. The aggregate and force_parallel_mode ensured that
> the query was executed as a parallel one, leading to allocation of the DSM
>
> Once the backend reached the posix_fallocate call and was waiting on a
> breakpoint, I called 'vacuum full test' on the master that lead to a
> conflict
> on the replica running 'select from test' (in a vast majority of cases),
> triggering the breakpoint in ResolveRecoveryConflictWithLock in the startup
> process, since the startup process tried to cancel the conflicting
> backend. At
> that point I continued execution of the startup process (which would loop
> in
> CancelVirtualTransaction sending SIGUSR1 to the backend while the backend
> waited to be resumed from the breakpoint). Right after that, I changed the
> size
> parameter on the backend to something that would make posix_fallocate run
> for a
> bit longer, typically ten to hundred MB. Once the backend process was
> resumed,
> it started receiving SIGUSR1 from the startup process, resulting in
> posix_fallocate existing with EINTR.
>
> With the patch applied, the posix_fallocate loop terminated right away
> (because
> of QueryCancelPending flag set to true) and the backend went through the
> cleanup, showing an ERROR of cancelling due to the conflict with recovery.
> Without the patch, it looped indefinitely in the dsm_impl_posix_resize,
> while
> the startup process were looping forever, trying to send SIGUSR1.
>
> One thing I’m wondering is whether we could do the same by just blocking
> SIGUSR1
> for the duration of posix_fallocate?
>
Many thanks!
If we were to do that, I would say we should mask all signals we can mask
during the call.
I don't have a problem going down that road instead if people think it is
better.
As a note, we have a patched version of PostgreSQL 10.5 ready to deploy to
a system affected by this and expect that to be done this week.
>
> Cheers,
> Oleksii Kliukin
--
Best Regards,
Chris Travers
Head of Database
Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin
From | Date | Subject | |
---|---|---|---|
Next Message | Chris Travers | 2018-09-17 13:28:02 | Strange OSX make check-world failure |
Previous Message | Michael Paquier | 2018-09-17 13:13:40 | SSL tests failing with "ee key too small" error on Debian SID |