Re: Single transaction in the tablesync worker?

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Single transaction in the tablesync worker?
Date: 2021-01-23 12:26:44
Message-ID: CAA4eK1JhpuwujrV6ABMmZ3jXfW37ssZnJ3fikrY7rRdvoEmu_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 23, 2021 at 4:55 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> PSA the v18 patch for the Tablesync Solution1.
>

Few comments:
=============
1.
- * So the state progression is always: INIT -> DATASYNC -> SYNCWAIT ->
- * CATCHUP -> SYNCDONE -> READY.
+ * So the state progression is always: INIT -> DATASYNC ->
+ * (sync worker FINISHEDCOPY) -> SYNCWAIT -> CATCHUP -> SYNCDONE -> READY.

I don't think we need to be specific here that sync worker sets
FINISHEDCOPY state.

2.
@@ -98,11 +102,16 @@
#include "miscadmin.h"
#include "parser/parse_relation.h"
#include "pgstat.h"
+#include "postmaster/interrupt.h"
#include "replication/logicallauncher.h"
#include "replication/logicalrelation.h"
+#include "replication/logicalworker.h"
#include "replication/walreceiver.h"
#include "replication/worker_internal.h"
+#include "replication/slot.h"

I don't think the above includes are required. They seem to the
remnant of the previous approach.

3.
process_syncing_tables_for_sync(XLogRecPtr current_lsn)
{
- Assert(IsTransactionState());
+ bool sync_done = false;

SpinLockAcquire(&MyLogicalRepWorker->relmutex);
+ sync_done = MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
+ current_lsn >= MyLogicalRepWorker->relstate_lsn;
+ SpinLockRelease(&MyLogicalRepWorker->relmutex);

- if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
- current_lsn >= MyLogicalRepWorker->relstate_lsn)
+ if (sync_done)
{
TimeLineID tli;

+ /*
+ * Change state to SYNCDONE.
+ */
+ SpinLockAcquire(&MyLogicalRepWorker->relmutex);

Why do we need these changes? If you have done it for the
code-readability purpose then we can consider this as a separate patch
because I don't see why these are required w.r.t this patch.

4.
- /*
- * To build a slot name for the sync work, we are limited to NAMEDATALEN -
- * 1 characters. We cut the original slot name to NAMEDATALEN - 28 chars
- * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0'). (It's actually the
- * NAMEDATALEN on the remote that matters, but this scheme will also work
- * reasonably if that is different.)
- */
- StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small"); /* for sanity */
- slotname = psprintf("%.*s_%u_sync_%u",
- NAMEDATALEN - 28,
- MySubscription->slotname,
- MySubscription->oid,
- MyLogicalRepWorker->relid);
+ /* Calculate the name of the tablesync slot. */
+ slotname = ReplicationSlotNameForTablesync(
+ MySubscription->oid,
+ MyLogicalRepWorker->relid);

What is the reason for changing the slot name calculation? If there is
any particular reasons, then we can add a comment to indicate why we
can't include the subscription's slotname in this calculation.

5.
This is WAL
+ * logged for for the purpose of recovery. Locks are to prevent the
+ * replication origin from vanishing while advancing.

/for for/for

6.
+ /* Remove the tablesync's origin tracking if exists. */
+ snprintf(originname, sizeof(originname), "pg_%u_%u", subid, relid);
+ originid = replorigin_by_name(originname, true);
+ if (originid != InvalidRepOriginId)
+ {
+ elog(DEBUG1, "DropSubscription: dropping origin tracking for
\"%s\"", originname);

I don't think we need this and the DEBUG1 message in
AlterSubscription_refresh. IT is fine to print this information for
background workers like in apply-worker but not sure if need it here.
The DropSubscription drops the origin of apply worker but it doesn't
use such a DEBUG message so I guess we don't it for tablesync origins
as well.

7. Have you tested with the new patch the scenario where we crash
after FINISHEDCOPY and before SYNCDONE, is it able to pick up the
replication using the new temporary slot? Here, we need to test the
case where during the catchup phase we have received few commits and
then the tablesync worker is crashed/errored out? Basically, check if
the replication is continued from the same point? I understand that
this can be only tested by adding some logs and we might not be able
to write a test for it.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Rofail 2021-01-23 12:34:59 Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Previous Message Bharath Rupireddy 2021-01-23 11:10:45 Re: Is Recovery actually paused?