Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From: Jeremy Schneider <schnjere(at)amazon(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical decoding bug: segfault in ReorderBufferToastReplace()
Date: 2021-06-08 18:35:57
Message-ID: 444215b4-8fb5-6a82-a534-645abafbffb4@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-committers pgsql-hackers

On 6/4/21 23:42, Amit Kapila wrote:
> On 2021-Jun-04, Jeremy Schneider wrote:
>>> ERROR: XX000: could not open relation with OID 0
>>> LOCATION: ReorderBufferToastReplace, reorderbuffer.c:305
> Even, if this fixes the issue, I guess it is better to find why this
> happens? I think the reason why the code is giving an error is that
> after toast insertions we always expect the insert on the main table
> of toast table, but if there happens to be a case where after toast
> insertion, instead of getting the insertion on the main table we get
> an insert in some other table then you will see this error. I think
> this can happen for speculative insertions where insertions lead to a
> toast table insert, then we get a speculative abort record, and then
> insertion on some other table. The main thing is currently decoding
> code ignores speculative aborts due to which such a problem can occur.
> Now, there could be other cases where such a problem can happen but if
> my theory is correct then the patch we are discussing in the thread
> [1] should solve this problem.
>
> Jeremy, is this problem reproducible? Can we get a testcase or
> pg_waldump output of previous WAL records?
>
> [1] - https://www.postgresql.org/message-id/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P%2BXgsNAViug15Fm99jA%40mail.gmail.com

It's unclear to me whether or not we'll be able to catch the repro on
the actual production system. It seems that we are hitting this somewhat
consistently, but at irregular and infrequent intervals. If we are able
to catch it and walk the WAL records then I'll post back here. FYI,
Bertrand was able to replicate the exact error message with pretty much
the same repro that's in the other email thread which is linked above.

Separately, would there be any harm in adding the relation OID to the
error message? Personally, I just think the error message is generally
more useful if it shows the main relation OID (since we know that the
toast OID can be 0). Not a big deal though.

-Jeremy

--
Jeremy Schneider
Database Engineer
Amazon Web Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2021-06-08 18:54:15 Re: BUG #17050: cursor with for update + commit in loop
Previous Message Tom Lane 2021-06-08 18:02:50 Re: setting the timezone parameter with space cause diff result

Browse pgsql-committers by date

  From Date Subject
Next Message Bruce Momjian 2021-06-08 20:47:24 pgsql: doc: update release note item about the v2 wire protocol
Previous Message Tomas Vondra 2021-06-08 18:28:45 pgsql: Adjust batch size in postgres_fdw to not use too many parameters

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-06-08 18:40:31 Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Previous Message Tomas Vondra 2021-06-08 18:34:28 Re: Fdw batch insert error out when set batch_size > 65535