Re: Funny hang on PostgreSQL 10 during parallel index scan on slave

From: Chris Travers <chris(dot)travers(at)adjust(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Funny hang on PostgreSQL 10 during parallel index scan on slave
Date: 2018-09-06 11:31:43
Message-ID: CAN-RpxCkx=Hd8ETKCtxhm03LeXrzJ=LQgvwCVQ4o_L9raW0bKA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 6, 2018 at 11:08 AM Chris Travers <chris(dot)travers(at)adjust(dot)com>
wrote:

> Ok, so here's my current patch (work in progress). I have not run tests
> yet, and finding a way to add a test case is virtually impossible though I
> expect we will find ways of using gdb to confirm local fix later.
>
> After careful code review, I settled on the following approach which
> seemed to be the least intrusive.
>
> 1. Change the retry logic so that it does not retry of QueryCancelPending
> or ProcDiePending are set. In these cases EINTR is passed back to the
> caller
> 2. Change the error handling logic of the caller so that EINTR is handled
> by the next CHECK_FOR_INTERRUPTS() after cleanup.
>
> This means that the file descriptor is now already closed and that we
> handle this the same way we would with a ENOSPC. The reason for this is
> that there are many places in the code where it is not clear what degree of
> safety is present for throwing errors using ereport, and the patch needs to
> be as unobtrusive as possible to facilitate back porting.
>
> At this point the patch is for discussion. I have not even compiled it or
> tested it but maybe there is something wrong with my approach so figured I
> would send it out early.
>
> The major assumptions are:
> 1. close() will never take longer than the interrupt interval that caused
> the loop to break.
> 2. close() does not get interrupted in a way that will not cause cleanup
> problems later.
> 3. We will reach the next interrupt check at a reasonable interval and
> safe spot.
>
> Any initial arguments against?
>

The previous patch had two issues found on internal code review here. I am
sending a new patch.

This patch compiles. make check passes
make check-world fails but the errors are the same as found on master and
involve ecpg.

We will be doing further internal review and verification and then will run
through pg_indent before final submission.

Changes in this patch:
1. Previous patch had backwards check for EINTR in calling function. This
was fixed.
2. In cases where ERROR elevel was passed in, function would return
instead of error out on case of error.

So in this case we check if we expect to error on error and if so, check
for interrupts. If not, we go through the normal error handling/logging
path which might result in an warning that shared memory segment could not
be allocated followed by an actual meaningful error. I could put that in
an else if, if that is seen as a good idea, but figured I would raise it as
a discussion point.

>
> --
> Best Regards,
> Chris Travers
> Head of Database
>
> Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
> Saarbrücker Straße 37a, 10405 Berlin
>
>

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

Attachment Content-Type Size
racecondition.patch application/octet-stream 1.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2018-09-06 11:40:41 Re: [HACKERS] Bug in to_timestamp().
Previous Message Kyotaro HORIGUCHI 2018-09-06 11:17:02 Re: [HACKERS] PATCH: Keep one postmaster monitoring pipe per process