RE: Truncate in synchronous logical replication failed

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>
Cc: Ajin Cherian <itsajin(at)gmail(dot)com>, Japin Li <japinli(at)hotmail(dot)com>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Truncate in synchronous logical replication failed
Date: 2021-04-23 06:34:12
Message-ID: OSBPR01MB4888C84371882A4E001E3A44ED459@OSBPR01MB4888.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, April 22, 2021 6:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Tue, Apr 20, 2021 at 9:00 AM osumi(dot)takamichi(at)fujitsu(dot)com
> <osumi(dot)takamichi(at)fujitsu(dot)com> wrote:
> >
> > On Tuesday, April 20, 2021 10:53 AM Ajin Cherian <itsajin(at)gmail(dot)com>
> wrote:
> > >
> > > I reviewed the patch, ran make check, no issues. One minor comment:
> > >
> > > Could you add the comment similar to RelationGetIndexAttrBitmap() on
> > > why the redo, it's not very obvious to someone reading the code, why
> > > we are refetching the index list here.
> > >
> > > + /* Check if we need to redo */
> > >
> > > + newindexoidlist = RelationGetIndexList(relation);
> > Yeah, makes sense. Fixed.
>
> I don't think here we need to restart to get a stable list of indexes as we do in
> RelationGetIndexAttrBitmap. The reason is here we build the cache entry
> using a historic snapshot and all the later changes are absorbed while
> decoding WAL.
I rechecked this and I agree with this.
I don't see any problem to remove the redo check.
Based on this direction, I'll share my several minor comments.

[1] a typo of RelationGetIdentityKeyBitmap's comment

+ * This is a special purpose function used during logical replication. Here,
+ * unlike RelationGetIndexAttrBitmap(), we don't a acquire lock on the required

We have "a" in an unnatural place. It should be "we don't acquire...".

[2] suggestion to fix RelationGetIdentityKeyBitmap's comment

+ * later changes are absorbed while decoding WAL. Due to this reason, we don't
+ * need to retry here in case of a change in the set of indexes.

I think it's better to use "Because of" instead of "Due to".
This patch works to solve the deadlock.

[3] wrong comment in RelationGetIdentityKeyBitmap

+ /* Save some values to compare after building attributes */

I wrote this comment for the redo check
that has been removed already. We can delete it.

[4] suggestion to remove local variable relreplindex in RelationGetIdentityKeyBitmap

I think we don't need relreplindex.
We can just pass relaton->rd_replidindex to RelationIdGetRelation().
There is no more reference of the variable.

[5] suggestion to fix the place to release indexoidlist

I thought we can do its list_free() ealier,
after checking if there is no indexes.

[6] suggestion to remove period in one comment.

+ /* Quick exit if we already computed the result. */

This comes from RelationGetIndexAttrBitmap (and my posted versions)
but I think we can remove the period to give better alignment shared with other comments in the function.

> I have updated that and modified few comments in the
> attached patch. Can you please test this in clobber_cache_always mode? I
> think just testing subscription/t/010_truncate.pl would be sufficient.
I did it. It didn't fail. No problem.

> Now, this bug exists in prior versions (>= v11) where we have introduced
> decoding of Truncate. Do we want to backpatch this? As no user has reported
> this till now apart from Tang who I think got it while doing some other patch
> testing, we might want to just push it in HEAD. I am fine either way. Petr,
> others, do you have any opinion on this matter?
I think we are fine with applying this patch to the HEAD only,
since nobody has complained about the hang issue.

Best Regards,
Takamichi Osumi

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message wangyukun@fujitsu.com 2021-04-23 06:42:15 fix a comment
Previous Message tsunakawa.takay@fujitsu.com 2021-04-23 06:27:25 RE: A test for replay of regression tests