Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?
Date: 2018-03-04 21:46:08
Message-ID: CAEepm=2HmnTnCuYaWZvetRP3nsW1hMWtq+P8m+8pDh389in1wA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 5, 2018 at 4:05 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 03/04/2018 10:27 AM, Thomas Munro wrote:
>> I can fix it with the following patch, which writes XXX out to the log
>> where it would otherwise miss a final message sent just before
>> detaching with sufficiently bad timing/memory ordering. This patch
>> isn't my proposed fix, it's just a demonstration of what's busted.
>> There could be a better way to structure things than this.
>
> I can confirm this resolves the issue for me. Before the patch, I've
> seen 112 failures in ~11500 runs. With the patch I saw 0 failures, but
> about 100 messages XXX in the log.
>
> So my conclusion is that your analysis is likely correct.

Thanks! Here are a couple of patches. I'm not sure which I prefer.
The "pessimistic" one looks simpler and is probably the way to go, but
the "optimistic" one avoids doing an extra read until it has actually
run out of data and seen mq_detached == true.

I realised that the pg_write_barrier() added to
shm_mq_detach_internal() from the earlier demonstration/hack patch was
not needed... I had a notion that SpinLockAcquire() might not include
a strong enough barrier (unlike SpinLockRelease()), but after reading
s_lock.h I think it's not needed (since you get either TAS() or a
syscall-based slow path, both expected to include a full fence). I
haven't personally tested this on a weak memory order system.

Thoughts?

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
0001-Fix-detach-race-condition-in-shm_mq.c-optimistic.patch application/octet-stream 1.6 KB
0001-Fix-detach-race-condition-in-shm_mq.c-pessimistic.patch application/octet-stream 1.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2018-03-04 22:15:58 Re: [HACKERS] Support for Secure Transport SSL library on macOS as OpenSSL alternative
Previous Message Tomas Vondra 2018-03-04 21:38:11 Re: [HACKERS] user-defined numeric data types triggering ERROR: unsupported type