Re: Single transaction in the tablesync worker?

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Single transaction in the tablesync worker?
Date: 2021-01-12 12:47:09
Message-ID: CAHut+PsvDaSktxfXhkMNuh5j5K=Ze81kdBrmmcBXrKcRhPi0hA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 11, 2021 at 3:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Jan 8, 2021 at 8:20 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, Jan 8, 2021 at 7:14 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> > >
> > > FYI, I was able to reproduce this case in debugger. PSA logs showing details.
> > >
> >
> > Thanks for reproducing as I was worried about exactly this case. I
> > have one question related to logs:
> >
> > ##
> > ## ALTER SUBSCRIPTION to REFRESH the publication
> >
> > ## This blocks on some latch until the tablesync worker dies, then it continues
> > ##
> >
> > Did you check which exact latch or lock blocks this?
> >
>
> I have checked this myself and the command is waiting on the drop of
> origin till the tablesync worker is finished because replorigin_drop()
> requires state->acquired_by to be 0 which will only be true once the
> tablesync worker exits. I think this is the reason you might have
> noticed that the command can't be finished until the tablesync worker
> died. So this can't be an interlock between ALTER SUBSCRIPTION ..
> REFRESH command and tablesync worker and to that end it seems you have
> below Fixme's in the patch:

I have also seen this same blocking reason before in the replorigin_drop().
However, back when I first tested/reproduced the refresh issue
[test-refresh] that
AlterSubscription_refresh was still *original* unchanged code, so at
that time it did not
have any replorigin_drop() in at all. In any case in the latest code
[v14] the AlterSubscription is
immediately stopping the workers so this question may be moot.

>
> + * FIXME - Usually this cleanup would be OK, but will not
> + * always be OK because the logicalrep_worker_stop_at_commit
> + * only "flags" the worker to be stopped in the near future
> + * but meanwhile it may still be running. In this case there
> + * could be a race between the tablesync worker and this code
> + * to see who will succeed with the tablesync drop (and the
> + * loser will ERROR).
> + *
> + * FIXME - Also, checking the state is also not guaranteed
> + * correct because state might be TCOPYDONE when we checked
> + * but has since progressed to SYNDONE
> + */
> +
> + if (state == SUBREL_STATE_TCOPYDONE)
> + {
>
> I feel this was okay for an earlier code but now we need to stop the
> tablesync workers before trying to drop the slot as we do in
> DropSubscription. Now, if we do that then that would fix the race
> conditions mentioned in Fixme but still, there are few more things I
> am worried about: (a) What if the launcher again starts the tablesync
> worker? One idea could be to acquire AccessExclusiveLock on
> SubscriptionRelationId as we do in DropSubscription which is not a
> very good idea but I can't think of any other good way. (b) the patch
> is just checking SUBREL_STATE_TCOPYDONE before dropping the
> replication slot but the slot could be created even before that (in
> SUBREL_STATE_DATASYNC state). One idea could be we can try to drop the
> slot and if we are not able to drop then we can simply continue
> assuming it didn't exist.

The code was modified in the latest patch [v14] something like as suggested.

The workers for removed tables are now immediately stopped (like
DropSubscription does). Although I did include the AccessExclusiveLock
as (a) suggested, AFAIK this was actually ineffective at preventing
the workers relaunching. Instead, I am using
logicalrep_worker_stop_at_commit to do this - testing shows it as
working ok. Please see the code and latest test logs [v14] for
details.

Also, now the Drop/AlterSubscription will only give WARNING if unable
to drop slots, a per suggestion (b). This is also tested [v14].

>
> One minor comment:
> 1.
> + SpinLockAcquire(&MyLogicalRepWorker->relmutex);
> MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE;
> MyLogicalRepWorker->relstate_lsn = current_lsn;
> -
>
> Spurious line removal.

Fixed in latest patch [v14]

----
[v14] = https://www.postgresql.org/message-id/CAHut%2BPsPO2vOp%2BP7U2szROMy15PKKGanhUrCYQ0ffpy9zG1V1A%40mail.gmail.com
[test-refresh] https://www.postgresql.org/message-id/CAHut%2BPv7YW7AyO_Q_nf9kzogcJcDFQNe8FBP6yXdzowMz3dY_Q%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-01-12 13:13:46 Re: Track replica origin progress for Rollback Prepared
Previous Message Masahiko Sawada 2021-01-12 12:40:53 Re: Key management with tests