Re: Segmentation Fault in logical decoding get/peek API

From: Jeremy Finzel <finzelj(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Segmentation Fault in logical decoding get/peek API
Date: 2019-03-04 14:28:36
Message-ID: CAMa1XUivSJ2idopJUhMi-CWundycnH4OdwWOiZ_D+12BvMLObg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Feb 27, 2019 at 8:34 AM Jeremy Finzel <finzelj(at)gmail(dot)com> wrote:

> On Thu, Feb 14, 2019 at 5:05 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Jeremy Finzel <finzelj(at)gmail(dot)com> writes:
>> > On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> >> So my bet is that your problem was fixed by some other commit between
>> >> 10.3 and 10.6. Maybe the predecessor one, b767b3f2e; but hard to say
>> >> without more investigation than seems warranted, if the bug's gone.
>>
>> > I am willing to put in more time debugging this because we want to know
>> > which clusters are actually susceptible to the bug. Any suggestions to
>> > proceed are welcome.
>>
>> Well, as Peter said, "git bisect" and trying to reproduce the problem
>> at each step would be the way to prove it definitively. Seems mighty
>> tedious though. Possibly you could shave some time off the process
>> by assuming it must have been one of the commits that touched
>> reorderbuffer.c ... a quick check says there have been ten of those
>> in the v10 branch since 10.3.
>>
>> regards, tom lane
>
>
I have confirmed which commit fixed the segfault:

cee1dd1eeda1e7b86b78f240d24bbfde21d75928 is the first fixed commit
commit cee1dd1eeda1e7b86b78f240d24bbfde21d75928
Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Date: Tue Mar 6 15:57:20 2018 -0300

Refrain from duplicating data in reorderbuffers

If a walsender exits leaving data in reorderbuffers, the next walsender
that tries to decode the same transaction would append its decoded data
in the same spill files without truncating it first, which effectively
duplicate the data. Avoid that by removing any leftover reorderbuffer
spill files when a walsender starts.

Backpatch to 9.4; this bug has been there from the very beginning of
logical decoding.

Author: Craig Ringer, revised by me
Reviewed by: Álvaro Herrera, Petr Jelínek, Masahiko Sawada

I'm very curious if anyone more familiar with the code base understands how
the old code was vulnerable to a segfault, but not the new code. I don't
think in my situation a walsender was exiting at all, so I do think the
specific reason for this code change accidentally fixed a potential
segfault unrelated to an exiting walsender leaving data in reorderbuffers.

I still can't explain precisely what reproduces it. As I have said before,
some unique things about my setup are lots of INSERT ... ON CONFLICT UPDATE
queries, as well as btree gist exclusion indexes, but I suspect this has to
do with the former somehow, although I haven't been able to create a
reliably reproducing crash.

I suggest at least adding something to the 10.4 (and corollary major
version) release notes around this.

Here is the release note now:

- In logical decoding, avoid possible double processing of WAL data when
a walsender restarts (Craig Ringer)

Perhaps some addition like this:

- In logical decoding, avoid possible double processing of WAL data when
a walsender restarts (Craig Ringer)
This change also had the effect of fixing a potential segfault
vulnerability in the logical decoding peek/get_changes API in
ReorderBufferCommit

Thanks,
Jeremy

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Langote 2019-03-04 15:12:25 Re: BUG #15662: Row update in foreign partition does not move row based on partition key
Previous Message Tom Lane 2019-03-04 14:27:43 Re: BUG #15666: Seemingly incorrect error reporting/logging for violation of non-null constraint