Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Henry Hinze <henry(dot)hinze(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Date: 2020-09-30 21:52:38
Message-ID: 912614.1601502758@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> On 2020-Sep-30, Tom Lane wrote:
>> The question that this raises is how the heck did that get past
>> our test suites? It seems like the error should have been obvious
>> to even the most minimal testing.

> ... yeah, that's indeed an important question. I'm going to guess that
> the TAP suites are too forgiving :-(

One thing I noticed while trying to trace this down is that while the
initial table sync is happening, we have *both* a regular
walsender/walreceiver pair and a "sync" pair, eg

postgres 905650 0.0 0.0 186052 11888 ? Ss 17:12 0:00 postgres: logical replication worker for subscription 16398
postgres 905651 50.1 0.0 173704 13496 ? Ss 17:12 0:09 postgres: walsender postgres [local] idle
postgres 905652 104 0.4 186832 148608 ? Rs 17:12 0:19 postgres: logical replication worker for subscription 16398 sync 16393
postgres 905653 12.2 0.0 174380 15524 ? Ss 17:12 0:02 postgres: walsender postgres [local] COPY

Is it supposed to be like that? Notice also that the regular walsender
has consumed significant CPU time; it's not pinning a CPU like the sync
walreceiver is, but it's eating maybe 20% of a CPU according to "top".
I wonder whether in cases with only small tables (which is likely all
that our tests test), the regular walreceiver manages to complete the
table sync despite repeated(?) failures of the sync worker.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Robert Haas 2020-09-30 22:10:38 Re: BUG #16419: wrong parsing BC year in to_date() function
Previous Message Alvaro Herrera 2020-09-30 21:42:16 Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop