Re: Single transaction in the tablesync worker?

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Craig Ringer <craig(dot)ringer(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>
Subject: Re: Single transaction in the tablesync worker?
Date: 2020-12-30 06:15:17
Message-ID: CAHut+Pu4qmstyCYWjcCwBXamq4WP+wGd0BUXL8OCTZBD=YAXVQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 23, 2020 at 9:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Dec 22, 2020 at 4:58 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> >
> > On Mon, Dec 21, 2020 at 11:36 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Dec 21, 2020 at 3:17 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> > > >
> > > > On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > > Few other comments:
> > > > > ==================
> > > >
> > > > Thanks for your feedback.
> > > >
> > > > > 1.
> > > > > * FIXME 3 - Crashed tablesync workers may also have remaining slots
> > > > > because I don't think
> > > > > + * such workers are even iterated by this loop, and nobody else is
> > > > > removing them.
> > > > > + */
> > > > > + if (slotname)
> > > > > + {
> > > > >
> > > > > The above FIXME is not clear to me. Actually, the crashed workers
> > > > > should restart, finish their work, and drop the slots. So not sure
> > > > > what exactly this FIXME refers to?
> > > >
> > > > Yes, normally if the tablesync can complete it should behave like that.
> > > > But I think there are other scenarios where it may be unable to
> > > > clean-up after itself. For example:
> > > >
> > > > i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
> > > > handled by tablesync can give a PK violation which also will crash
> > > > again and again for each re-launched/replacement tablesync worker.
> > > > This can be reproduced in the debugger. If the DropSubscription
> > > > doesn't clean-up the tablesync's slot then nobody will.
> > > >
> > > > ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
> > > > ensure that the launcher doesn't restart new worker during dropping
> > > > the subscription".
> > > >
> > >
> > > Yeah, I have also read that comment but do you know how it is
> > > preventing relaunch? How does the subscription lock help?
> >
> > Hmmm. I did see there is a matching lock in get_subscription_list of
> > launcher.c, which may be what that code comment was referring to. But
> > I also am currently unsure how this lock prevents anybody (e.g.
> > process_syncing_tables_for_apply) from executing another
> > logicalrep_worker_launch.
> >
>
> process_syncing_tables_for_apply will be called by the apply worker
> and we are stopping the apply worker. So, after that launcher won't
> start a new apply worker because of get_subscription_list() and if the
> apply worker is not started then it won't be able to start tablesync
> worker. So, we need the handling of crashed tablesync workers here
> such that we need to drop any new sync slots.

Yes, in the v6 patch code this was a problem in need of handling. But
since the v7 patch the DropSubscription code is now using a separate
GetSubscriptionNotReadyRelations loop to handle the cleanup of
potentially leftover slots from crashed tablesync workers (i.e.
workers that never got to a READY state).

Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2020-12-30 06:21:20 Re: Single transaction in the tablesync worker?
Previous Message Bharath Rupireddy 2020-12-30 06:10:29 Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit