Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, "Wei Wang (Fujitsu)" <wangw(dot)fnst(at)fujitsu(dot)com>, "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date: 2023-08-01 06:32:29
Message-ID: CALDaNm1-7BmJDNFnC8T8ALwcCL+8LkX_54XmcZ2Ma2ji_MkDdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 1 Aug 2023 at 09:44, Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> On Fri, Jul 28, 2023 at 5:22 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> >
> > Hi Melih,
> >
> > BACKGROUND
> > ----------
> >
> > We wanted to compare performance for the 2 different reuse-worker
> > designs, when the apply worker is already busy handling other
> > replications, and then simultaneously the test table tablesyncs are
> > occurring.
> >
> > To test this scenario, some test scripts were written (described
> > below). For comparisons, the scripts were then run using a build of
> > HEAD; design #1 (v21); design #2 (0718).
> >
> > HOW THE TEST WORKS
> > ------------------
> >
> > Overview:
> > 1. The apply worker is made to subscribe to a 'busy_tbl'.
> > 2. After the SUBSCRIPTION is created, the publisher-side then loops
> > (forever) doing INSERTS into that busy_tbl.
> > 3. While the apply worker is now busy, the subscriber does an ALTER
> > SUBSCRIPTION REFRESH PUBLICATION to subscribe to all the other test
> > tables.
> > 4. We time how long it takes for all tablsyncs to complete
> > 5. Repeat above for different numbers of empty tables (10, 100, 1000,
> > 2000) and different numbers of sync workers (2, 4, 8, 16)
> >
> > Scripts
> > -------
> >
> > (PSA 4 scripts to implement this logic)
> >
> > testrun script
> > - this does common setup (do_one_test_setup) and then the pub/sub
> > scripts (do_one_test_PUB and do_one_test_SUB -- see below) are run in
> > parallel
> > - repeat 10 times
> >
> > do_one_test_setup script
> > - init and start instances
> > - ipc setup tables and procedures
> >
> > do_one_test_PUB script
> > - ipc setup pub/sub
> > - table setup
> > - publishes the "busy_tbl", but then waits for the subscriber to
> > subscribe to only this one
> > - alters the publication to include all other tables (so subscriber
> > will see these only after the ALTER SUBSCRIPTION PUBLICATION REFRESH)
> > - enter a busy INSERT loop until it informed by the subscriber that
> > the test is finished
> >
> > do_one_test_SUB script
> > - ipc setup pub/sub
> > - table setup
> > - subscribes only to "busy_tbl", then informs the publisher when that
> > is done (this will cause the publisher to commence the stay_busy loop)
> > - after it knows the publishing busy loop has started it does
> > - ALTER SUBSCRIPTION REFRESH PUBLICATION
> > - wait until all the tablesyncs are ready <=== This is the part that
> > is timed for the test RESULT
> >
> > PROBLEM
> > -------
> >
> > Looking at the output files (e.g. *.dat_PUB and *.dat_SUB) they seem
> > to confirm the tests are working how we wanted.
> >
> > Unfortunately, there is some slot problem for the patched builds (both
> > designs #1 and #2). e.g. Search "ERROR" in the *.log files and see
> > many slot-related errors.
> >
> > Please note - running these same scripts with HEAD build gave no such
> > errors. So it appears to be a patch problem.
> >
>
> Hi
>
> FYI, here is some more information about ERRORs seen.
>
> The patches were re-tested -- applied in stages (and also against the
> different scripts) to identify where the problem was introduced. Below
> are the observations:
>
> ~~~
>
> Using original test scripts
>
> 1. Using only patch v21-0001
> - no errors
>
> 2. Using only patch v21-0001+0002
> - no errors
>
> 3. Using patch v21-0001+0002+0003
> - no errors
>
> ~~~
>
> Using the "busy loop" test scripts for long transactions
>
> 1. Using only patch v21-0001
> - no errors
>
> 2. Using only patch v21-0001+0002
> - gives errors for "no copy in progress issue"
> e.g. ERROR: could not send data to WAL stream: no COPY in progress
>
> 3. Using patch v21-0001+0002+0003
> - gives the same "no copy in progress issue" errors as above
> e.g. ERROR: could not send data to WAL stream: no COPY in progress
> - and also gives slot consistency point errors
> e.g. ERROR: could not create replication slot
> "pg_16700_sync_16514_7261998170966054867": ERROR: could not find
> logical decoding starting point
> e.g. LOG: could not drop replication slot
> "pg_16700_sync_16454_7261998170966054867" on publisher: ERROR:
> replication slot "pg_16700_sync_16454_7261998170966054867" does not
> exist

I agree that "no copy in progress issue" issue has nothing to do with
0001 patch. This issue is present with the 0002 patch.
In the case when the tablesync worker has to apply the transactions
after the table is synced, the tablesync worker sends the feedback of
writepos, applypos and flushpos which results in "No copy in progress"
error as the stream has ended already. Fixed it by exiting the
streaming loop if the tablesync worker is done with the
synchronization. The attached 0004 patch has the changes for the same.
The rest of v22 patches are the same patch that were posted by Melih
in the earlier mail.

Regards,
Vignesh

Attachment Content-Type Size
v22-0002-Reuse-Tablesync-Workers.patch text/x-patch 10.1 KB
v22-0003-Reuse-connection-when-tablesync-workers-change-t.patch text/x-patch 6.9 KB
v22-0001-Refactor-to-split-Apply-and-Tablesync-Workers.patch text/x-patch 25.4 KB
0004-Fix-for-Table-sync-worker-sending-the-feedback-even-.patch text/x-patch 1.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2023-08-01 06:40:46 Re: Simplify some logical replication worker type checking
Previous Message Kyotaro Horiguchi 2023-08-01 06:28:54 Re: Incorrect handling of OOM in WAL replay leading to data loss