Re: Single transaction in the tablesync worker?

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Single transaction in the tablesync worker?
Date: 2021-02-02 05:03:26
Message-ID: CAA4eK1JVNs02JsvdKsd+iXGxnXcPgnpkZ-Tmq554Po=fSxBWyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 2, 2021 at 8:29 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> On Mon, Feb 1, 2021 at 11:26 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> > I have updated the patch to display WARNING for each of the tablesync
> > slots during DropSubscription. As discussed, I have moved the drop
> > slot related code towards the end in AlterSubscription_refresh. Apart
> > from this, I have fixed one more issue in tablesync code where in
> > after catching the exception we were not clearing the transaction
> > state on the publisher, see changes in LogicalRepSyncTableStart. I
> > have also fixed other comments raised by you.
>
> Here are some additional feedback comments about the v24 patch:
>
> ~~
>
> ReportSlotConnectionError:
>
> 1,2,3,4.
> + foreach(lc, rstates)
> + {
> + SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
> + Oid relid = rstate->relid;
> +
> + /* Only cleanup resources of tablesync workers */
> + if (!OidIsValid(relid))
> + continue;
> +
> + /*
> + * Caller needs to ensure that we have appropriate locks so that
> + * relstate doesn't change underneath us.
> + */
> + if (rstate->state != SUBREL_STATE_SYNCDONE)
> + {
> + char syncslotname[NAMEDATALEN] = { 0 };
> +
> + ReplicationSlotNameForTablesync(subid, relid, syncslotname);
> + elog(WARNING, "could not drop tablesync replication slot \"%s\"",
> + syncslotname);
> +
> + }
> + }
>
> 1. I wonder if "rstates" would be better named something like
> "not_ready_rstates", otherwise it is not apparent what states are in
> this list
>

I don't know if that would be better and it is used in the same way in
the existing code. I find the current naming succinct.

> 2. The comment "/* Only cleanup resources of tablesync workers */" is
> not quite correct because there is no cleanup happening here. Maybe
> change to:
> if (!OidIsValid(relid))
> continue; /* not a tablesync worker */
>

Aren't we trying to cleanup the tablesync slots here? So, I don't see
the comment as irrelevant.

> 3. Maybe the "appropriate locks" comment can say what locks are the
> "appropriate" ones?
>
> 4. Spurious blank line after the elog?
>

Will fix both the above.

> ~~
>
> AlterSubscription_refresh:
>
> 5.
> + /*
> + * Drop the tablesync slot. This has to be at the end because
> otherwise if there
> + * is an error while doing the database operations we won't be able to rollback
> + * dropped slot.
> + */
>
> Maybe "Drop the tablesync slot." should say "Drop the tablesync slots
> associated with removed tables."
>

makes sense, will fix.

> ~~
>
> DropSubscription:
>
> 6.
> + /*
> + * Cleanup of tablesync replication origins.
> + *
> + * Any READY-state relations would already have dealt with clean-ups.
> + *
> + * Note that the state can't change because we have already stopped both
> + * the apply and tablesync workers and they can't restart because of
> + * exclusive lock on the subscription.
> + */
> + rstates = GetSubscriptionNotReadyRelations(subid);
> + foreach(lc, rstates)
>
> I wonder if "rstates" would be better named as "not_ready_rstates",
> because it is used in several places where not READY is assumed.
>

Same response as above for similar comment.

> 7.
> + {
> + if (!slotname)
> + {
> + /* be tidy */
> + list_free(rstates);
> + return;
> + }
> + else
> + {
> + ReportSlotConnectionError(rstates, subid, slotname, err);
> + }
> +
> + }
>
> Spurious blank line above?
>

Will fix.

> 8.
> The new logic of calling the ReportSlotConnectionError seems to be
> expecting that the user has encountered some connection error, and
> *after* that they have assigned slot_name = NONE as a workaround. In
> this scenario the code looks ok since names of any dangling tablesync
> slots were being logged at the time of the error.
>
> But I am wondering what about where the user might have set slot_name
> = NONE *before* the connection is broken. In this scenario, there is
> no ERROR, so if there are still (is it possible?) dangling tablesync
> slots, their names are never getting logged at all. So how can the
> user know what to delete?
>

It has been mentioned in docs that the user is responsible for
cleaning that up manually in such a case. The patch has also described
how the names are generated so that can help user to remove those.
+ These table synchronization slots have generated names:
+ <quote><literal>pg_%u_sync_%u</literal></quote> (parameters: Subscription
+ <parameter>oid</parameter>, Table <parameter>relid</parameter>)

I think if the user changes slot_name associated with the
subscription, it would be his responsibility to clean up the
previously associated slot. This is currently the case with the main
subscription slot as well. I think it won't be advisable for the user
to change slot_name unless under some rare cases where the system
might be stuck like the one for which we are giving WARNING and
providing a hint for setting the slot_name to NONE.

> ~~
>
> > Additionally, I have
> > removed the test because it was creating the same name slot as the
> > tablesync worker and tablesync worker removed the same due to new
> > logic in LogicalRepSyncStart. Earlier, it was not failing because of
> > the bug in that code which I have fixed in the attached.
>
> Wasn't causing a tablesync slot clash and seeing if it could recover
> the point of that test? Why not just keep, and modify the test to make
> it work again?
>

We can do that but my other worry was that we might want to reserve
the names for slots that start with pg_.

> Isn't it still valuable because at least it would
> execute the code through the PG_CATCH which otherwise may not get
> executed by any other test?
>

It is valuable but IIRC there was a test (in subscription/004_sync.pl)
where PK violation happens during copy which will lead to the coverage
of code in CATCH.

> >
> > I wonder whether we should restrict creating slots with prefix pg_
> > because we are internally creating slots with those names? I think
> > this was a problem previously also. We already prohibit it for few
> > other objects like origins, schema, etc., see the usage of
> > IsReservedName.
> >
>
> Yes, we could restrict the create slot API if you really wanted to.
> But, IMO it is implausible that a user could "accidentally" clash with
> the internal tablesync slot name, so in practice maybe this change
> would not help much but it might make it more difficult to test some
> scenarios.
>

Isn't the same true for origins?

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ajin Cherian 2021-02-02 05:03:51 Re: Single transaction in the tablesync worker?
Previous Message Michael Paquier 2021-02-02 05:02:53 Re: row filtering for logical replication