Re: BUG #14721: Assertion of synchronous replication

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, const_sunny(at)126(dot)com
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14721: Assertion of synchronous replication
Date: 2017-07-12 12:45:48
Message-ID: 22a6cc48-0803-49d9-d1f3-dd5206f24c21@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 06/29/2017 08:11 AM, Thomas Munro wrote:
> On Thu, Jun 29, 2017 at 4:27 PM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> /*
>> * Acquiring the lock is not needed, the latch ensures proper
>> * barriers. If it looks like we're done, we must really be done,
>> * because once walsender changes the state to SYNC_REP_WAIT_COMPLETE,
>> * it will never update it again, so we can't be seeing a stale value
>> * in that case.
>> */
>
> Yeah, counting on the latch for free barriers doesn't work if you
> happen to see SYNC_REP_WAIT_COMPLETE first time through the loop, or
> if you see it after a spurious signal woke you and then it's
> immediately set to SYNC_REP_WAIT_COMPLETE. In those cases, the
> following Assert statement is making an assertion about cache
> coherency that doesn't work even on a friendly TSO system.

Ah yes, that's clearly broken.

> Can you reproduce the problem with this experimental patch applied?

I was able to reproduce this, after adding some sleeps (see attached).
It must be pretty hard to hit it normally, or we would've gotten reports
earlier. Although it only really shows with assertions enabled, so that
could also be why.

In your patch, I think the pg_read_barrier() is only necessary, if
assertions are enabled. Then again, better safe than sorry, and at least
on my x86-64 machine with gcc, it doesn't actually change the produced
machine code at all.

I committed your patch and backported to 9.5, with some additional
comments. Thanks for the report Const!

- Heikki

Attachment Content-Type Size
reproduce-assert-failure.patch text/x-diff 593 bytes

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Heikki Linnakangas 2017-07-12 13:15:10 Re: BUG #14722: Segfault in tuplesort_heap_siftup, 32 bit overflow
Previous Message Heikki Linnakangas 2017-07-12 11:32:11 Re: BUG #14654: With high statistics targets on ts_vector, unexpectedly high memory use & OOM are triggered