Re: Single transaction in the tablesync worker?

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Craig Ringer <craig(dot)ringer(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>
Subject: Re: Single transaction in the tablesync worker?
Date: 2021-01-04 09:08:52
Message-ID: CAA4eK1JdEGGhW_AzfH+HGPx0pbqfdFn=C_xWFGyeH8FyHo7r9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 30, 2020 at 11:51 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> On Wed, Dec 23, 2020 at 8:43 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > 1.
> > + * Rarely, the DropSubscription may be issued when a tablesync still
> > + * is in SYNCDONE but not yet in READY state. If this happens then
> > + * the drop slot could fail because it is already dropped.
> > + * In this case suppress and drop slot error.
> > + *
> > + * FIXME - Is there a better way than this?
> > + */
> > + if (rstate->state != SUBREL_STATE_SYNCDONE)
> > + PG_RE_THROW();
> >
> > So, does this situation happens when we try to drop subscription after
> > the state is changed to syncdone but not syncready. If so, then can't
> > we write a function GetSubscriptionNotDoneRelations similar to
> > GetSubscriptionNotReadyRelations where we get a list of relations that
> > are not in done stage. I think this should be safe because once we are
> > here we shouldn't be allowed to start a new worker and old workers are
> > already stopped by this function.
>
> Yes, but I don't see how adding such a function is an improvement over
> the existing code:
>

The advantage is that we don't need to use try..catch to deal with
such conditions which I don't think is a good way to deal with such
cases. Also, even after using try...catch, still, we can leak the
slots because the patch drops the slot after changing the state to
syncdone and if there is any error while dropping the slot, it simply
skips it. So, it is possible that the rel state is syncdone but the
slot still exists and we get an error due to some different reason,
and then we will silently skip it again and allow the subscription to
be dropped.

I think instead what we should do is to drop the slot before we change
the rel state to syncdone. Also, if the apply workers fail to drop the
slot, it should try to again drop it after restart. In
DropSubscription, we can then check if the rel state is not SYNC or
READY, we can drop the corresponding slots.

> e.g.1. GetSubscriptionNotDoneRelations will include the READY state
> (which we don't want) just like GetSubscriptionNotReadyRelations
> includes the SYNCDONE state.
> e.g.2. Or, something like GetSubscriptionNotDoneAndNotReadyRelations
> would be an unnecessary overkill replacement for the current simple
> "if".
>

or we can probably modify the function as
GetSubscriptionRelationsNotInStates and pass it an array of states
which we don't want.

> AFAIK the code is OK as-is.
>

As described above, there are still race conditions where we can leak
slots and also this doesn't look clean.

Few other comments:
=================
1.
+ elog(LOG, "!!>> DropSubscription: dropping the tablesync slot
\"%s\".", syncslotname);
+ ReplicationSlotDropAtPubNode(wrconn, syncslotname);
+ elog(LOG, "!!>> DropSubscription: dropped the tablesync slot
\"%s\".", syncslotname);

...
...

+ elog(LOG, "!!>> finish_sync_worker: dropping the tablesync slot
\"%s\".", syncslotname);
+ ReplicationSlotDropAtPubNode(wrconn, syncslotname);
+ elog(LOG, "!!>> finish_sync_worker: dropped the tablesync slot
\"%s\".", syncslotname);

Remove these and other elogs added to aid debugging or testing. If you
need these for development purposes then move these to separate patch.

2. Remove WIP from the commit message and patch name.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2021-01-04 09:28:30 Re: Single transaction in the tablesync worker?
Previous Message Amit Kapila 2021-01-04 08:18:05 Re: [HACKERS] logical decoding of two-phase transactions