Re: Fix for segfault in logical replication on master

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, osumi(dot)takamichi(at)fujitsu(dot)com
Cc: "akapila(at)postgresql(dot)org" <akapila(at)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Fix for segfault in logical replication on master
Date: 2021-06-17 15:56:49
Message-ID: A4C0513E-ED3F-4F5A-8C52-813112F217FA@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Jun 17, 2021, at 6:40 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> I think such a problem won't happen because we are using historic
> snapshots in this context. We rely on that in a similar way in
> reorderbuffer.c, see ReorderBufferProcessTXN.

I think you are right, but that's the part I have trouble fully convincing myself is safe. We certainly have an historic snapshot when we call RelationGetIndexList, but that has an early exit if the relation already has fields set, and we don't know if those fields were set before or after the historic snapshot was taken. Within the context of the pluggable infrastructure, I think we're safe. The only caller of RelationGetIdentityKeyBitmap() in core is logicalrep_write_attrs(), which is only called by logicalrep_write_rel(), which is only called by send_relation_and_attrs(), which is only called by maybe_send_schema(), which is called by pgoutput_change() and pgoutput_truncate(), both being callbacks in core's logical replication plugin.

ReorderBufferProcessTXN calls SetupHistoricSnapshot before opening the relation and then calling ReorderBufferApplyChange to invoke the plugin on that opened relation, so the relation's fields could not have been setup before the snapshot was taken. Any other plugin would similarly get invoked after that same logic, so they'd be fine, too. The problem would only be if somebody called RelationGetIdentityKeyBitmap() or one of its calling functions from outside that infrastructure. Is that worth worrying about? The function comments for those mention having an historic snapshot, and the Assert will catch if code doesn't have one, but I wonder how much of a trap for the unwary that is, considering that somebody might open the relation and lookup indexes for the relation before taking an historic snapshot and calling these functions.

I thought it was cheap enough to check that the relation we open is an index, because if it is not, we'll segfault when accessing fields of the relation->rd_index struct. I wouldn't necessarily advocate doing any really expensive checks here, but a quick sanity check seemed worth the effort. If you don't want to commit that part, I'm not going to put up a huge fuss.

Since neither of you knew why I was performing that check, it is clear that my code comment was insufficient. I have added a more detailed code comment to explain the purpose of the check. I also changed the first check to use RelationIsValid(), as suggested upthread.

Attachment Content-Type Size
v2-0001-Fixing-bug-in-logical-replication.patch application/octet-stream 4.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-06-17 16:42:30 Re: Replication protocol doc fix
Previous Message Andrew Dunstan 2021-06-17 15:52:30 Re: Patch for bug #17056 fast default on non-plain table