Re: Funny hang on PostgreSQL 10 during parallel index scan on slave

From: Chris Travers <chris(dot)travers(at)adjust(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Funny hang on PostgreSQL 10 during parallel index scan on slave
Date: 2018-09-05 17:12:49
Message-ID: CAN-RpxB4iVAkGFowRSh=Sj8ShYHJE7nmbpT=Z4iKO7JKZgQi5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 5, 2018 at 6:55 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2018-09-05 18:48:44 +0200, Chris Travers wrote:
> > Will submit a patch here shortly. Thanks! Should we do for master and
> > 10? Or 9.6 too?
>
> Please don't top-post on this list. This needs to be done in all
> branches where the posix_fallocate call is present.
>
> > > Yep, Maybe we should check for signals there.
> > >
> > > On Wed, Sep 5, 2018 at 5:27 PM Thomas Munro <
> thomas(dot)munro(at)enterprisedb(dot)com>
> > > wrote:
> > >
> > >> On Wed, Sep 5, 2018 at 8:23 AM Chris Travers <
> chris(dot)travers(at)adjust(dot)com>
> > >> wrote:
> > >> > 1. The query is in a parallel index scan or similar
> > >> > 2. A process is executing a parallel plan and allocating a
> significant
> > >> chunk of memory (2MB for example) in dynamic shared memory.
> > >> > 3. The startup process goes into a loop where it sends a sigusr1,
> > >> sleeps 5m, and sends another sigusr1 etc.
> > >> > 4. The sigusr1 aborts the system call, which is then retried.
> > >> > 5. Because the system call takes more than 5ms, we end up in an
> > >> endless loop
>
> What you're presumably encountering here is a recovery conflict.
>

Agreed but the question is how to correct what is a fairly interesting race
condition.

>
>
> > On Wed, Sep 5, 2018 at 6:40 PM Chris Travers <chris(dot)travers(at)adjust(dot)com>
> > wrote:
> > >> Do you mean this loop in dsm_impl_posix_resize() is getting
> > >> interrupted constantly and never completing?
> > >>
> > >> /* We may get interrupted, if so just retry. */
> > >> do
> > >> {
> > >> rc = posix_fallocate(fd, 0, size);
> > >> } while (rc == EINTR);
> > >>
>
> Probably worthwile to check that the dsm code is properly robust if
> errors are thrown from within here.
>

Will check that too. Thanks!

>
>
> Greetings,
>
> Andres Freund
>

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2018-09-05 17:23:06 Re: Bug in ginRedoRecompress that causes opaque data on page to be overrun
Previous Message Tom Lane 2018-09-05 17:06:36 Re: Bug fix for glibc broke freebsd build in REL_11_STABLE