Re: Single transaction in the tablesync worker?

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Ajin Cherian <itsajin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Single transaction in the tablesync worker?
Date: 2021-02-05 01:39:22
Message-ID: CAHut+PuxESrD8Yd=U+_3=fLkpHGk-EvAHpMb_-H_ZAqk45jv7A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 4, 2021 at 8:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
...

> Thanks. I have fixed one of the issues reported by me earlier [1]
> wherein the tablesync worker can repeatedly fail if after dropping the
> slot there is an error while updating the SYNCDONE state in the
> database. I have moved the drop of the slot just before commit of the
> transaction where we are marking the state as SYNCDONE. Additionally,
> I have removed unnecessary includes in tablesync.c, updated the docs
> for Alter Subscription, and updated the comments at various places in
> the patch. I have also updated the commit message this time.
>

Below are my feedback comments for V17 (nothing functional)

~~

1.
V27 Commit message:
For the initial table data synchronization in logical replication, we use
a single transaction to copy the entire table and then synchronizes the
position in the stream with the main apply worker.

Typo:
"synchronizes" -> "synchronize"

~~

2.
@@ -48,6 +48,23 @@ ALTER SUBSCRIPTION <replaceable
class="parameter">name</replaceable> RENAME TO <
(Currently, all subscription owners must be superusers, so the owner checks
will be bypassed in practice. But this might change in the future.)
</para>
+
+ <para>
+ When refreshing a publication we remove the relations that are no longer
+ part of the publication and we also remove the tablesync slots if there are
+ any. It is necessary to remove tablesync slots so that the resources
+ allocated for the subscription on the remote host are released. If due to
+ network breakdown or some other error, we are not able to remove the slots,
+ we give WARNING and the user needs to manually remove such slots later as
+ otherwise, they will continue to reserve WAL and might eventually cause
+ the disk to fill up. See also <xref
linkend="logical-replication-subscription-slot"/>.
+ </para>

I think the content is good, but the 1st-person wording seemed strange.
e.g.
"we are not able to remove the slots, we give WARNING and the user needs..."
Maybe it should be like:
"... PostgreSQL is unable to remove the slots, so a WARNING is
reported. The user needs... "

~~

3.
@@ -566,107 +569,197 @@ AlterSubscription_refresh(Subscription *sub,
bool copy_data)
...
+ * XXX If there is a network break down while dropping the

"network break down" -> "network breakdown"

~~

4.
@@ -872,7 +970,48 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
(errmsg("could not connect to the publisher: %s", err)));
...
+ * XXX We could also instead try to drop the slot, last time we failed
+ * but for that, we might need to clean up the copy state as it might
+ * be in the middle of fetching the rows. Also, if there is a network
+ * break down then it wouldn't have succeeded so trying it next time
+ * seems like a better bet.

"network break down" -> "network breakdown"

~~

5.
@@ -269,26 +313,47 @@ invalidate_syncing_table_states(Datum arg, int
cacheid, uint32 hashvalue)
...
+
+ /*
+ * Cleanup the tablesync slot.
+ *
+ * This has to be done after updating the state because otherwise if
+ * there is an error while doing the database operations we won't be
+ * able to rollback dropped slot.
+ */
+ ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid,
+ MyLogicalRepWorker->relid,
+ syncslotname);
+
+ ReplicationSlotDropAtPubNode(wrconn, syncslotname, false /* missing_ok */);
+

Should this comment also describe why the missing_ok is false for this case?

----
Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ian Lawrence Barwick 2021-02-05 01:40:12 psql tab completion for CREATE DATABASE ... LOCALE
Previous Message Bharath Rupireddy 2021-02-05 00:52:30 Re: Is Recovery actually paused?