Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Henry Hinze <henry(dot)hinze(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Date: 2020-11-17 02:14:23
Message-ID: CAHut+Psprtsa4o89wtNnKLxxwXeDKAX9nNsdghT1Pv63siz+AA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Nov 17, 2020 at 1:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> Yeah, this seems to be possible and this is the reason I mentioned
> above to dig more into this case. Did you try it via some test case? I
> think we can generate a test via debugger where after the tablesync
> worker reaches CATCHUP state stop it via debugger, then we can perform
> some large transaction on the same table which apply worker will skip
> and tablesync worker will try to apply changes and should fail.

Hello Amit.

FYI - This email is just to confirm that your above idea for debugging
the tablesync worker does work.

I have made a small temporary patch to aid testing this: PSA.

The patch just adds some more LOGs - it helps see what is going on.
The patch also gives a 30 second opportunity to attach to the
tablesync worker process.

----

So the necessary debugging steps are like this:

0. Start debugger
- Later we will attach to the tablesync worker process so it is
convenient to set all the breakpoint in advance
(gdb) b LogicalRepSyncTableStart
(gdb) b LogicalRepApplyLoop
(gdb) b apply_dispatch
(gdb) b process_syncing_tables_for_sync

1. Pub/Sub nodes: Create/Start Publisher and Subscriber nodes with
same test table.

2. Pub node: Add some initial data to the table (to give tablesync
init something to copy)

3. Pub node: CREATE PUBLISHER

4. Sub node: CREATE SUBSCRIBER
- at this point the slot gets created and the tablesync worker and
apply worker are launched
- that patch causes the tablesync worker to sleep 30 seconds to give
opportunity to attach to it in debugger

5. Debugger/tablesync: Attach to the tablesync worker
- continue
- breaks at LogicalRepSyncTableStart
- step over all the initial COPY code to the end of the function
(where apply worker has told it to CATCHUP)

6. Pub: While the tablesync worker is paused in debugger do some other
operations on the original table
- e.g. psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 'foo');"

7. Debugger/tablesync: Allow the tablesync worker to continue
- tablesync worker will return from LogicalRepSyncTableStart and then
into LogicalRepApplyLoop
- if tablesync worker detects it is "behind" then the
LogicalRepApplyLoop will call apply_dispatch to deal with the message
resulting from the step 6 operation.

8. Debugger/tablesync: Some apply handlers (e.g. apply_handle_commit)
if not the LogicalRepApplyLoop itself, will end up calling the
process_syncing_tables.
- which calls process_syncing_tables_for_sync
- which will cause the tablesync worker to set SYNCDONE status just
before killing itself.

9. Sub node: The apply worker wait loop will now proceed again
tick/tick/tick every 1 second to process any incoming replication
messages.

---

Using the above technique, you can cause a tablesync worker to enter
some of the normal "apply" handling which might otherwise be very hard
to reproduce outside of the debugger.

---

Here is an example of some of the logging for the above debugging scenario:

2020-11-17 10:46:11.325 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:12.328 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:12.328 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:13.337 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:13.337 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:14.340 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:14.340 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:15.155 AEDT [9841] LOG: !!>> tablesync worker: About
to call LogicalRepSyncTableStart to do initial syncing
2020-11-17 10:46:15.343 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:15.343 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:15.343 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:15.343 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:16.349 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:16.349 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:17.351 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:17.351 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:18.353 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:18.353 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:18.650 AEDT [10734] LOG: logical decoding found
consistent point at 0/1B6D5E0
2020-11-17 10:46:18.650 AEDT [10734] DETAIL: There are no running transactions.
2020-11-17 10:46:18.650 AEDT [10734] STATEMENT:
CREATE_REPLICATION_SLOT "tap_sub_24599_sync_16385" TEMPORARY LOGICAL
pgoutput USE_SNAPSHOT
2020-11-17 10:46:18.658 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:18.658 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:18.744 AEDT [9841] LOG: !!>> tablesync worker: wait
for CATCHUP state notification
2020-11-17 10:46:18.744 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:18.744 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables

psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 'foo');"
INSERT 0 1

[postgres(at)CentOS7-x64 ~]$ 2020-11-17 10:47:39.269 AEDT [9841] LOG:
!!>> tablesync worker: received CATCHUP state notification
2020-11-17 10:47:44.117 AEDT [9841] LOG: !!>> tablesync worker:
Returned from LogicalRepSyncTableStart
2020-11-17 10:48:05.678 AEDT [10734] LOG: starting logical decoding
for slot "tap_sub_24599_sync_16385"
2020-11-17 10:48:05.678 AEDT [10734] DETAIL: Streaming transactions
committing after 0/1B6D618, reading WAL from 0/1B6D5E0.
2020-11-17 10:48:05.678 AEDT [10734] STATEMENT: START_REPLICATION
SLOT "tap_sub_24599_sync_16385" LOGICAL 0/1B6D618 (proto_version '2',
publication_names '"tap_pub"')
2020-11-17 10:48:05.678 AEDT [10734] LOG: logical decoding found
consistent point at 0/1B6D5E0
2020-11-17 10:48:05.678 AEDT [10734] DETAIL: There are no running transactions.
2020-11-17 10:48:05.678 AEDT [10734] STATEMENT: START_REPLICATION
SLOT "tap_sub_24599_sync_16385" LOGICAL 0/1B6D618 (proto_version '2',
publication_names '"tap_pub"')
2020-11-17 10:48:09.258 AEDT [9841] LOG: !!>> tablesync worker:
LogicalRepApplyLoop
2020-11-17 10:49:36.537 AEDT [9841] LOG: !!>> tablesync worker:
called process_syncing_tables
2020-11-17 10:49:36.538 AEDT [9841] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has finished
2020-11-17 10:49:36.588 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:36.589 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:36.590 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:37.591 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:37.591 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:38.592 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:38.592 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:39.594 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:39.594 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:40.595 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:40.595 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables

---

I have so far only been trying above with the non-streaming
subscription, and TBH stepping through this startup state "dance" of
the tablesync/apply workers is already causing more questions than
answers for me. Anyway, I will raise any questions as separate emails
to this one.

BTW, do you think these tablesync discussions should be moved to
hackers list? I think they are no longer related to the reported BUG
of this current thread.

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment Content-Type Size
v1-0001-Degugging-tablesync-worker.patch application/octet-stream 3.1 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2020-11-17 08:58:53 Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Previous Message PG Bug reporting form 2020-11-16 18:15:20 BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.