Re: Single transaction in the tablesync worker?

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Craig Ringer <craig(dot)ringer(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>
Subject: Re: Single transaction in the tablesync worker?
Date: 2020-12-21 12:38:02
Message-ID: CAA4eK1JUzeDFZgLLe8O2tTeTGNoUhCfeD3Yv+YLxMDT5L-C8zg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 21, 2020 at 3:17 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> > Few other comments:
> > ==================
>
> Thanks for your feedback.
>
> > 1.
> > * FIXME 3 - Crashed tablesync workers may also have remaining slots
> > because I don't think
> > + * such workers are even iterated by this loop, and nobody else is
> > removing them.
> > + */
> > + if (slotname)
> > + {
> >
> > The above FIXME is not clear to me. Actually, the crashed workers
> > should restart, finish their work, and drop the slots. So not sure
> > what exactly this FIXME refers to?
>
> Yes, normally if the tablesync can complete it should behave like that.
> But I think there are other scenarios where it may be unable to
> clean-up after itself. For example:
>
> i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
> handled by tablesync can give a PK violation which also will crash
> again and again for each re-launched/replacement tablesync worker.
> This can be reproduced in the debugger. If the DropSubscription
> doesn't clean-up the tablesync's slot then nobody will.
>
> ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
> ensure that the launcher doesn't restart new worker during dropping
> the subscription".
>

Yeah, I have also read that comment but do you know how it is
preventing relaunch? How does the subscription lock help?

> So executing DROP SUBSCRIPTION will prevent a newly
> crashed tablesync from re-launching, so it won’t be able to take care
> of its own slot. If the DropSubscription doesn't clean-up that
> tablesync's slot then nobody will.
>

> >
> > 2.
> > DropSubscription()
> > {
> > ..
> > ReplicationSlotDropAtPubNode(
> > + NULL,
> > + conninfo, /* use conninfo to make a new connection. */
> > + subname,
> > + syncslotname);
> > ..
> > }
> >
> > With the above call, it will form a connection with the publisher and
> > drop the required slots. I think we need to save the connection info
> > so that we don't need to connect/disconnect for each slot to be
> > dropped. Later in this function, we again connect and drop the apply
> > worker slot. I think we should connect just once drop the apply and
> > table sync slots if any.
>
> OK. IIUC this is a suggestion for more efficient connection usage,
> rather than actual bug right?
>

Yes, it is for effective connection usage.

> I have added this suggestion to my TODO
> list.
>
> >
> > 3.
> > ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn_given, char
> > *conninfo, char *subname, char *slotname)
> > {
> > ..
> > + PG_TRY();
> > ..
> > + PG_CATCH();
> > + {
> > + /* NOP. Just gobble any ERROR. */
> > + }
> > + PG_END_TRY();
> >
> > Why are we suppressing the error instead of handling it the error in
> > the same way as we do while dropping the apply worker slot in
> > DropSubscription?
>
> This function is common - it is also called from the tablesync
> finish_sync_worker. But in the finish_sync_worker case I wanted to
> avoid throwing an ERROR which would cause the tablesync to crash and
> relaunch (and crash/relaunch/repeat...) when all it was trying to do
> in the first place was just cleanup and exit the process. Perhaps the
> error suppression should be conditional depending where this function
> is called from?
>

Yeah, that could be one way and if you follow my previous suggestion
this function might change a bit more.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-12-21 13:25:11 Re: [Patch] Optimize dropping of relation buffers using dlist
Previous Message Andrew Gierth 2020-12-21 12:02:16 Incorrect assertion in ExecBuildAggTrans ?