Re: Single transaction in the tablesync worker?

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Single transaction in the tablesync worker?
Date: 2021-02-02 02:59:04
Message-ID: CAHut+PutAaOUz_LwQvfnK8KaLawZHw7taVSPeV-dWu=5fkTbTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 1, 2021 at 11:26 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> I have updated the patch to display WARNING for each of the tablesync
> slots during DropSubscription. As discussed, I have moved the drop
> slot related code towards the end in AlterSubscription_refresh. Apart
> from this, I have fixed one more issue in tablesync code where in
> after catching the exception we were not clearing the transaction
> state on the publisher, see changes in LogicalRepSyncTableStart. I
> have also fixed other comments raised by you.

Here are some additional feedback comments about the v24 patch:

~~

ReportSlotConnectionError:

1,2,3,4.
+ foreach(lc, rstates)
+ {
+ SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+ Oid relid = rstate->relid;
+
+ /* Only cleanup resources of tablesync workers */
+ if (!OidIsValid(relid))
+ continue;
+
+ /*
+ * Caller needs to ensure that we have appropriate locks so that
+ * relstate doesn't change underneath us.
+ */
+ if (rstate->state != SUBREL_STATE_SYNCDONE)
+ {
+ char syncslotname[NAMEDATALEN] = { 0 };
+
+ ReplicationSlotNameForTablesync(subid, relid, syncslotname);
+ elog(WARNING, "could not drop tablesync replication slot \"%s\"",
+ syncslotname);
+
+ }
+ }

1. I wonder if "rstates" would be better named something like
"not_ready_rstates", otherwise it is not apparent what states are in
this list

2. The comment "/* Only cleanup resources of tablesync workers */" is
not quite correct because there is no cleanup happening here. Maybe
change to:
if (!OidIsValid(relid))
continue; /* not a tablesync worker */

3. Maybe the "appropriate locks" comment can say what locks are the
"appropriate" ones?

4. Spurious blank line after the elog?

~~

AlterSubscription_refresh:

5.
+ /*
+ * Drop the tablesync slot. This has to be at the end because
otherwise if there
+ * is an error while doing the database operations we won't be able to rollback
+ * dropped slot.
+ */

Maybe "Drop the tablesync slot." should say "Drop the tablesync slots
associated with removed tables."

~~

DropSubscription:

6.
+ /*
+ * Cleanup of tablesync replication origins.
+ *
+ * Any READY-state relations would already have dealt with clean-ups.
+ *
+ * Note that the state can't change because we have already stopped both
+ * the apply and tablesync workers and they can't restart because of
+ * exclusive lock on the subscription.
+ */
+ rstates = GetSubscriptionNotReadyRelations(subid);
+ foreach(lc, rstates)

I wonder if "rstates" would be better named as "not_ready_rstates",
because it is used in several places where not READY is assumed.

7.
+ {
+ if (!slotname)
+ {
+ /* be tidy */
+ list_free(rstates);
+ return;
+ }
+ else
+ {
+ ReportSlotConnectionError(rstates, subid, slotname, err);
+ }
+
+ }

Spurious blank line above?

8.
The new logic of calling the ReportSlotConnectionError seems to be
expecting that the user has encountered some connection error, and
*after* that they have assigned slot_name = NONE as a workaround. In
this scenario the code looks ok since names of any dangling tablesync
slots were being logged at the time of the error.

But I am wondering what about where the user might have set slot_name
= NONE *before* the connection is broken. In this scenario, there is
no ERROR, so if there are still (is it possible?) dangling tablesync
slots, their names are never getting logged at all. So how can the
user know what to delete?

~~

> Additionally, I have
> removed the test because it was creating the same name slot as the
> tablesync worker and tablesync worker removed the same due to new
> logic in LogicalRepSyncStart. Earlier, it was not failing because of
> the bug in that code which I have fixed in the attached.

Wasn't causing a tablesync slot clash and seeing if it could recover
the point of that test? Why not just keep, and modify the test to make
it work again? Isn't it still valuable because at least it would
execute the code through the PG_CATCH which otherwise may not get
executed by any other test?

>
> I wonder whether we should restrict creating slots with prefix pg_
> because we are internally creating slots with those names? I think
> this was a problem previously also. We already prohibit it for few
> other objects like origins, schema, etc., see the usage of
> IsReservedName.
>

Yes, we could restrict the create slot API if you really wanted to.
But, IMO it is implausible that a user could "accidentally" clash with
the internal tablesync slot name, so in practice maybe this change
would not help much but it might make it more difficult to test some
scenarios.

----
Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-02-02 03:17:04 Re: New IndexAM API controlling index vacuum strategies
Previous Message Fujii Masao 2021-02-02 02:32:30 Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit