Re: Single transaction in the tablesync worker?

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Single transaction in the tablesync worker?
Date: 2021-01-28 07:01:58
Message-ID: CAHut+PsNMHutjM2WnHe1VooHVqxtPsL6igemPWy2BOUK7OeAVw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 27, 2021 at 2:53 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Sat, Jan 23, 2021 at 5:56 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Sat, Jan 23, 2021 at 4:55 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> > >
> > > PSA the v18 patch for the Tablesync Solution1.
> >
> > 7. Have you tested with the new patch the scenario where we crash
> > after FINISHEDCOPY and before SYNCDONE, is it able to pick up the
> > replication using the new temporary slot? Here, we need to test the
> > case where during the catchup phase we have received few commits and
> > then the tablesync worker is crashed/errored out? Basically, check if
> > the replication is continued from the same point?
> >
>
> I have tested this and it didn't work, see the below example.
>
> Publisher-side
> ================
> CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
>
> BEGIN;
> INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
> INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
> COMMIT;
>
> CREATE PUBLICATION mypublication FOR TABLE mytbl1;
>
> Subscriber-side
> ================
> - Have a while(1) loop in LogicalRepSyncTableStart so that tablesync
> worker stops.
>
> CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
>
>
> CREATE SUBSCRIPTION mysub
> CONNECTION 'host=localhost port=5432 dbname=postgres'
> PUBLICATION mypublication;
>
> During debug, stop after we mark FINISHEDCOPY state.
>
> Publisher-side
> ================
> INSERT INTO mytbl1(somedata, text) VALUES (1, 3);
> INSERT INTO mytbl1(somedata, text) VALUES (1, 4);
>
>
> Subscriber-side
> ================
> - Have a breakpoint in apply_dispatch
> - continue in debugger;
> - After we replay first commit (which will be for values(1,3), note
> down the origin position in apply_handle_commit_internal and somehow
> error out. I have forced the debugger to set to the last line in
> apply_dispatch where the error is raised.
> - After the error, again the tablesync worker is restarted and it
> starts from the position noted in the previous step
> - It exits without replaying the WAL for (1,4)
>
> So, on the subscriber-side, you will see 3 records. Fourth is missing.
> Now, if you insert more records on the publisher, it will anyway
> replay those but the fourth one got missing.
>
> The temporary slots didn't seem to work because we created again the
> new temporary slot after the crash and ask it to start decoding from
> the point we noted in origin_lsn. The publisher didn’t hold the
> required WAL as our slot was temporary so it started sending from some
> later point. We retain WAL based on the slots restart_lsn position and
> wal_keep_size. For our case, the positions of the slots will matter
> and as we have created temporary slots, there is no way for a
> publisher to save that WAL.
>
> In this particular case, even if the WAL would have been there we only
> pass the start_decoding_at position but didn’t pass restart_lsn, so it
> picked a random location (current insert position in WAL) which is
> ahead of start_decoding_at point so it never sent the required fourth
> record. Now, I don’t think it will work even if somehow sent the
> correct restart_lsn because of what I wrote earlier that there is no
> guarantee that the earlier WAL would have been saved.
>
> At this point, I can't think of any way to fix this problem except for
> going back to the previous approach of permanent slots but let me know
> if you have any ideas to salvage this approach?
>

OK. The latest patch [v21] now restores the permanent slot (and slot
cleanup) approach as it was implemented in an earlier version [v17].
Please note that this change also re-introduces some potential slot
cleanup problems for some race scenarios. These will be addressed by
future patches.

----
[v17] https://www.postgresql.org/message-id/CAHut%2BPt9%2Bg8qQR0kMC85nY-O4uDQxXboamZAYhHbvkebzC9fAQ%40mail.gmail.com
[v21] https://www.postgresql.org/message-id/CAHut%2BPvzHRRA_5O0R8KZCb1tVe1mBVPxFtmttXJnmuOmAegoWA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2021-01-28 07:10:53 Re: Wrong usage of RelationNeedsWAL
Previous Message Noah Misch 2021-01-28 06:55:08 Re: pg_upgrade fails with non-standard ACL