Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Henry Hinze <henry(dot)hinze(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Date: 2020-10-14 01:12:07
Message-ID: 20201014011207.GA18985@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 2020-Oct-12, Petr Jelinek wrote:

> It's not only about size of the tables, it's mainly that there is no write
> activity so the main apply is not moving past the LSN at which table sync
> has started at. With bigger table you get at least something written
> (running xacts, autovacuum, or whatever) that moves lsn forward eventually.

I see -- yeah, okay.

> > However, and this is one reason why I'd welcome Petr/Peter thoughts on
> > this, I don't really understand what happens in LogicalRepApplyLoop
> > afterwards with a tablesync worker; are we actually doing anything
> > useful there, considering that the actual data copy seems to have
> > occurred in the CopyFrom() call in copy_table? In other words, by the
> > time we return control to ApplyWorkerMain with a slot name, isn't the
> > work all done, and the only thing we need is to synchronize protocol and
> > close the connection?
>
> There are 2 possible states at that point, either tablesync is ahead (when
> main apply lags or nothing is happening on publication side) or it's behind
> the main apply. When tablesync is ahead we are indeed done and just need to
> update the state of the table (which is what the code you removed did, but
> LogicalRepApplyLoop should do it as well, just a bit later). When it's
> behind we need to do catchup for that table only which still happens in the
> tablesync worker. See the explanation at the beginning of tablesync.c, it
> probably needs some small adjustments after the changes in your first patch.

... Ooh, things start to make some sense now. So how about the
attached? There are some not really related cleanups. (Changes to
protocol.sgml are still pending.)

If I understand correcly, the early exit in tablesync.c is not saving *a
lot* of time (we don't actually skip replaying any WAL), even if it's
saving execution of a bunch of code. So I stand by my position that
removing the code is better because it's clearer about what is actually
happening.

Attachment Content-Type Size
0001-Restore-logical-replication-dupe-command-tags.patch text/x-diff 3.1 KB
0002-Review-logical-replication-tablesync-code.patch text/x-diff 15.8 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kyotaro Horiguchi 2020-10-14 03:05:10 Re: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted
Previous Message Tom Lane 2020-10-13 14:29:19 Re: BUG #16665: Segmentation fault