Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Henry Hinze <henry(dot)hinze(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Date: 2020-11-07 05:38:22
Message-ID: CAFiTN-sn5odfWKAB2UM14NbtWx_bn6RXSJpeMXaezc+ANf0Png@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sat, Nov 7, 2020 at 9:23 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Sat, Nov 7, 2020 at 5:31 AM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> >
> > On 2020-Nov-05, Amit Kapila wrote:
> >
> > > On Wed, Nov 4, 2020 at 7:19 PM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> > > >
> > > > On 2020-Nov-04, Amit Kapila wrote:
> > > >
> > > > > On Thu, Oct 15, 2020 at 8:20 PM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> > > >
> > > > > > * STREAM COMMIT bug?
> > > > > > In apply_handle_stream_commit, we do CommitTransactionCommand, but
> > > > > > apparently in a tablesync worker we shouldn't do it.
> > > > >
> > > > > In the tablesync stage, we don't allow streaming. See pgoutput_startup
> > > > > where we disable streaming for the init phase. As far as I understand,
> > > > > for tablesync we create the initial slot during which streaming will
> > > > > be disabled then we will copy the table (here logical decoding won't
> > > > > be used) and then allow the apply worker to get any other data which
> > > > > is inserted in the meantime. Now, I might be missing something here
> > > > > but if you can explain it a bit more or share some test to show how we
> > > > > can reach here via tablesync worker then we can discuss the possible
> > > > > solution.
> > > >
> > > > Hmm, okay, that sounds like there would be no bug then. Maybe what we
> > > > need is just an assert in apply_handle_stream_commit that
> > > > !am_tablesync_worker(), as in the attached patch. Passes tests.
> > > >
> > >
> > > +1. But do we want to have this Assert only in stream_commit API or
> > > all stream APIs as well?
> >
> > Well, the only reason I care about this is that apply_handle_commit
> > contains a comment that we must not do CommitTransactionCommand in the
> > syncworker case; so if you look at apply_handle_stream_commit and note
> > that it doesn't concern it about that, you become concerned that it
> > might be broken. I don't think the other routines handling the "stream"
> > thing have that issue.
> >
>
> Fair enough, as mentioned in my previous email, I think we need to
> confirm once that after copy how the decoding happens on upstream for
> transactions during the phase where tablesync workers is moving to
> state SUBREL_STATE_SYNCDONE from SUBREL_STATE_CATCHUP. I'll try to
> come up (in next few days) with some test case to debug and test this
> particular scenario and share my findings.

IIUC, the table sync worker does the initial copy using the consistent
snapshot. And after that, if the main apply worker is behind us then
it will wait for the apply worker to reach the table sync worker's
start point and after that, the apply worker can continue applying the
changes. OTOH, of the apply worker have already moved ahead in
processing the WAL after it had launched the table sync worker that
means the apply worker would have skipped those many transactions as
the table was not in SYNC DONE state so now the table sync worker need
to cover this gap by applying the walls using normal apply path and it
can be moved to the SYNC done state once it catches up with the actual
apply worker. After putting the table sync worker in the catchup
state the apply worker will wait for the table sync worker to catchup.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Dilip Kumar 2020-11-07 06:02:51 Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Previous Message Amit Kapila 2020-11-07 03:54:25 Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop