From: | Peter Smith <smithpb2250(at)gmail(dot)com> |
---|---|
To: | Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, "Wei Wang (Fujitsu)" <wangw(dot)fnst(at)fujitsu(dot)com>, "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com> |
Subject: | Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication |
Date: | 2023-08-01 04:14:02 |
Message-ID: | CAHut+Pv3z19dS+WQ0hSSSNdDJMwxQjJEnaH=BiKq65t9OF280g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Jul 28, 2023 at 5:22 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> Hi Melih,
>
> BACKGROUND
> ----------
>
> We wanted to compare performance for the 2 different reuse-worker
> designs, when the apply worker is already busy handling other
> replications, and then simultaneously the test table tablesyncs are
> occurring.
>
> To test this scenario, some test scripts were written (described
> below). For comparisons, the scripts were then run using a build of
> HEAD; design #1 (v21); design #2 (0718).
>
> HOW THE TEST WORKS
> ------------------
>
> Overview:
> 1. The apply worker is made to subscribe to a 'busy_tbl'.
> 2. After the SUBSCRIPTION is created, the publisher-side then loops
> (forever) doing INSERTS into that busy_tbl.
> 3. While the apply worker is now busy, the subscriber does an ALTER
> SUBSCRIPTION REFRESH PUBLICATION to subscribe to all the other test
> tables.
> 4. We time how long it takes for all tablsyncs to complete
> 5. Repeat above for different numbers of empty tables (10, 100, 1000,
> 2000) and different numbers of sync workers (2, 4, 8, 16)
>
> Scripts
> -------
>
> (PSA 4 scripts to implement this logic)
>
> testrun script
> - this does common setup (do_one_test_setup) and then the pub/sub
> scripts (do_one_test_PUB and do_one_test_SUB -- see below) are run in
> parallel
> - repeat 10 times
>
> do_one_test_setup script
> - init and start instances
> - ipc setup tables and procedures
>
> do_one_test_PUB script
> - ipc setup pub/sub
> - table setup
> - publishes the "busy_tbl", but then waits for the subscriber to
> subscribe to only this one
> - alters the publication to include all other tables (so subscriber
> will see these only after the ALTER SUBSCRIPTION PUBLICATION REFRESH)
> - enter a busy INSERT loop until it informed by the subscriber that
> the test is finished
>
> do_one_test_SUB script
> - ipc setup pub/sub
> - table setup
> - subscribes only to "busy_tbl", then informs the publisher when that
> is done (this will cause the publisher to commence the stay_busy loop)
> - after it knows the publishing busy loop has started it does
> - ALTER SUBSCRIPTION REFRESH PUBLICATION
> - wait until all the tablesyncs are ready <=== This is the part that
> is timed for the test RESULT
>
> PROBLEM
> -------
>
> Looking at the output files (e.g. *.dat_PUB and *.dat_SUB) they seem
> to confirm the tests are working how we wanted.
>
> Unfortunately, there is some slot problem for the patched builds (both
> designs #1 and #2). e.g. Search "ERROR" in the *.log files and see
> many slot-related errors.
>
> Please note - running these same scripts with HEAD build gave no such
> errors. So it appears to be a patch problem.
>
Hi
FYI, here is some more information about ERRORs seen.
The patches were re-tested -- applied in stages (and also against the
different scripts) to identify where the problem was introduced. Below
are the observations:
~~~
Using original test scripts
1. Using only patch v21-0001
- no errors
2. Using only patch v21-0001+0002
- no errors
3. Using patch v21-0001+0002+0003
- no errors
~~~
Using the "busy loop" test scripts for long transactions
1. Using only patch v21-0001
- no errors
2. Using only patch v21-0001+0002
- gives errors for "no copy in progress issue"
e.g. ERROR: could not send data to WAL stream: no COPY in progress
3. Using patch v21-0001+0002+0003
- gives the same "no copy in progress issue" errors as above
e.g. ERROR: could not send data to WAL stream: no COPY in progress
- and also gives slot consistency point errors
e.g. ERROR: could not create replication slot
"pg_16700_sync_16514_7261998170966054867": ERROR: could not find
logical decoding starting point
e.g. LOG: could not drop replication slot
"pg_16700_sync_16454_7261998170966054867" on publisher: ERROR:
replication slot "pg_16700_sync_16454_7261998170966054867" does not
exist
------
Kind Regards,
Peter Smith.
Fujitsu Australia
From | Date | Subject | |
---|---|---|---|
Next Message | Palak Chaturvedi | 2023-08-01 04:38:52 | Re: Extension Enhancement: Buffer Invalidation in pg_buffercache |
Previous Message | Michael Paquier | 2023-08-01 03:43:21 | Incorrect handling of OOM in WAL replay leading to data loss |