Re: Single transaction in the tablesync worker?

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Single transaction in the tablesync worker?
Date: 2021-02-01 04:44:29
Message-ID: CAA4eK1L4Mhgi-UAYmgGi-P58hdVyWs0t0e05qhvTwGy58OJpMQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 1, 2021 at 9:38 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> On Mon, Feb 1, 2021 at 1:54 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Mon, Feb 1, 2021 at 6:48 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> > >
> > > On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > I have made the below changes in the patch. Let me know what you think
> > > > about these?
> > > > 1. It was a bit difficult to understand the code in DropSubscription
> > > > so I have rearranged the code to match the way we are doing in HEAD
> > > > where we drop the slots at the end after finishing all the other
> > > > cleanup.
> > >
> > > There was a reason why the v22 logic was different from HEAD.
> > >
> > > The broken connection leaves dangling slots which is unavoidable.
> > >
> >
> > I think this is true only when the user specifically requested it by
> > the use of "ALTER SUBSCRIPTION ... SET (slot_name = NONE)", right?
> > Otherwise, we give an error on a broken connection. Also, if that is
> > true then is there a reason to pass missing_ok as true while dropping
> > tablesync slots?
> >
>
> AFAIK there is always a potential race with DropSubscription dropping
> slots. The DropSubscription might be running at exactly the same time
> the apply worker has just dropped the very same tablesync slot.
>

We stopped the workers before getting a list of NotReady relations and
then we try to drop the corresponding slots. So, how such a race
condition can happen? Note, because we have a lock on pg_subscrition,
there is no chance that the workers can restart till the transaction
end.

> By
> saying missing_ok = true it means DropSubscription would not give
> ERROR in such a case, so at least the DROP SUBSCRIPTION would not fail
> with an unexpected error.
>
> >
> > > But,
> > > whereas the user knows the name of the Subscription slot (they named
> > > it), there is no easy way for them to know the names of the remaining
> > > tablesync slots unless we log them.
> > >
> > > That is why the v22 code was written to process the tablesync slots
> > > even for wrconn == NULL so their name could be logged:
> > > elog(WARNING, "no connection; cannot drop tablesync slot \"%s\".",
> > > syncslotname);
> > >
> > > The v23 patch removed this dangling slot name information, so it makes
> > > it difficult for the user to know what tablesync slots to cleanup.
> > >
> >
> > Okay, then can we think of combining with the existing error of the
> > replication slot? I think that might produce a very long message, so
> > another idea could be to LOG a separate WARNING for each such slot
> > just before giving the error.
>
> There may be many subscribed tables so I agree combining to one
> message might be too long. Yes, we can add another loop to output the
> necessary information. But, isn’t logging each tablesync slot WARNING
> before the subscription slot ERROR exactly the behaviour which already
> existed in v22. IIUC the DropSubscription refactoring in V23 was not
> done for any functional change, but was intended only to make the code
> simpler, but how is that goal achieved if v23 ends up needing 3 loops
> where v22 only needed 1 loop to do the same thing?
>

No, there is a functionality change as well. The way we have code in
v22 can easily lead to a problem when we have dropped the slots but
get an error while removing origins or an entry from subscription rel.
In such cases, we won't be able to rollback the drop of slots but the
other database operations will be rolled back. This is the reason we
have to drop the slots at the end. We need to ensure the same thing
for AlterSubscription_refresh. Does this make sense now?

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hou, Zhijie 2021-02-01 05:02:18 RE: Determine parallel-safety of partition relations for Inserts
Previous Message Peter Smith 2021-02-01 04:08:13 Re: Single transaction in the tablesync worker?