Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Henry Hinze <henry(dot)hinze(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Date: 2020-11-18 02:47:54
Message-ID: CAHut+Pt4PyKQCwqzQ=EFF=bpKKJD7XKt_S23F6L20ayQNxg77A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Nov 18, 2020 at 1:29 PM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>
> On 2020-Nov-04, Amit Kapila wrote:
>
> > On Thu, Oct 15, 2020 at 8:20 PM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>
> > > * STREAM COMMIT bug?
> > > In apply_handle_stream_commit, we do CommitTransactionCommand, but
> > > apparently in a tablesync worker we shouldn't do it.
> >
> > In the tablesync stage, we don't allow streaming. See pgoutput_startup
> > where we disable streaming for the init phase. As far as I understand,
> > for tablesync we create the initial slot during which streaming will
> > be disabled then we will copy the table (here logical decoding won't
> > be used) and then allow the apply worker to get any other data which
> > is inserted in the meantime. Now, I might be missing something here
> > but if you can explain it a bit more or share some test to show how we
> > can reach here via tablesync worker then we can discuss the possible
> > solution.
>
> Hmm, okay, that sounds like there would be no bug then. Maybe what we
> need is just an assert in apply_handle_stream_commit that
> !am_tablesync_worker(), as in the attached patch. Passes tests.

Hi.

Using the same debugging technique described in a previous mail [1], I
have tested again but this time using a SUBSCRIPTION capable of
streaming.

While paused in the debugger (to force an unusual timing situation) I
can publish INSERTs en masse and cause streaming replication to occur.

To cut a long story short, a tablesync worker CAN in fact end up
processing (e.g. apply_dispatch) streaming messages.
So the tablesync worker CAN get into the apply_handle_stream_commit.
And this scenario, albeit rare, will crash.

For example,
---
Program received signal SIGABRT, Aborted.
0x00007f37570f22c7 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f37570f22c7 in raise () from /lib64/libc.so.6
#1 0x00007f37570f39b8 in abort () from /lib64/libc.so.6
#2 0x0000000000f2bf6a in ExceptionalCondition
(conditionName=0x11330a4 "IsTransactionState()", errorType=0x1133094
"FailedAssertion",
fileName=0x1133088 "tablesync.c", lineNumber=273) at assert.c:69
#3 0x0000000000bb046e in process_syncing_tables_for_sync
(current_lsn=23984432) at tablesync.c:273
#4 0x0000000000bb1039 in process_syncing_tables
(current_lsn=23984432) at tablesync.c:535
#5 0x0000000000bb5c7a in apply_handle_stream_commit
(s=0x7ffd2d892fd0) at worker.c:1080
#6 0x0000000000bb8edb in apply_dispatch (s=0x7ffd2d892fd0) at worker.c:1966
#7 0x0000000000bb97c9 in LogicalRepApplyLoop (last_received=23984432)
at worker.c:2166
#8 0x0000000000bbca84 in ApplyWorkerMain (main_arg=1) at worker.c:3122
#9 0x0000000000b36f11 in StartBackgroundWorker () at bgworker.c:820
#10 0x0000000000b5cf3a in do_start_bgworker (rw=0x1f3da50) at postmaster.c:5837
#11 0x0000000000b5d661 in maybe_start_bgworkers () at postmaster.c:6062
#12 0x0000000000b5b560 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5223
#13 <signal handler called>
#14 0x00007f37571b10d3 in __select_nocancel () from /lib64/libc.so.6
#15 0x0000000000b52a40 in ServerLoop () at postmaster.c:1691
#16 0x0000000000b51fa8 in PostmasterMain (argc=3, argv=0x1f164f0) at
postmaster.c:1400
#17 0x00000000009ab55d in main (argc=3, argv=0x1f164f0) at main.c:209
---

Also, PSA v2 of my "helper" patch for debugging the tablesync worker.
(This patch has more logging than v1)

---

[1] - https://www.postgresql.org/message-id/CAHut%2BPsprtsa4o89wtNnKLxxwXeDKAX9nNsdghT1Pv63siz%2BAA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment Content-Type Size
v2-0001-Degugging-tablesync-worker.patch application/octet-stream 5.1 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2020-11-18 04:18:18 Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Previous Message Fujii Masao 2020-11-18 01:18:20 Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.