Re: Column Filtering in Logical Replication

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>
Subject: Re: Column Filtering in Logical Replication
Date: 2022-03-18 17:12:20
Message-ID: 369ae611-8822-f499-87cd-58ad0d60c60c@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/18/22 15:43, Tomas Vondra wrote:
>
>
> On 3/18/22 06:52, Amit Kapila wrote:
>> On Fri, Mar 18, 2022 at 12:47 AM Tomas Vondra
>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>>
>>> I pushed the second fix. Interestingly enough, wrasse failed in the
>>> 013_partition test. I don't see how that could be caused by this
>>> particular commit, though - see the pgsql-committers thread [1].
>>>
>>
>> I have a theory about what's going on here. I think this is due to a
>> test added in your previous commit c91f71b9dc. The newly added test
>> added hangs in tablesync because there was no apply worker to set the
>> state to SUBREL_STATE_CATCHUP which blocked tablesync workers from
>> proceeding.
>>
>> See below logs from pogona [1].
>> 2022-03-18 01:33:15.190 CET [2551176][client
>> backend][3/74:0][013_partition.pl] LOG: statement: ALTER SUBSCRIPTION
>> sub2 SET PUBLICATION pub_lower_level, pub_all
>> 2022-03-18 01:33:15.354 CET [2551193][logical replication
>> worker][4/57:0][] LOG: logical replication apply worker for
>> subscription "sub2" has started
>> 2022-03-18 01:33:15.605 CET [2551176][client
>> backend][:0][013_partition.pl] LOG: disconnection: session time:
>> 0:00:00.415 user=bf database=postgres host=[local]
>> 2022-03-18 01:33:15.607 CET [2551209][logical replication
>> worker][3/76:0][] LOG: logical replication table synchronization
>> worker for subscription "sub2", table "tab4_1" has started
>> 2022-03-18 01:33:15.609 CET [2551211][logical replication
>> worker][5/11:0][] LOG: logical replication table synchronization
>> worker for subscription "sub2", table "tab3" has started
>> 2022-03-18 01:33:15.617 CET [2551193][logical replication
>> worker][4/62:0][] LOG: logical replication apply worker for
>> subscription "sub2" will restart because of a parameter change
>>
>> You will notice that the apply worker is never restarted after a
>> parameter change. The reason was that the particular subscription
>> reaches the limit of max_sync_workers_per_subscription after which we
>> don't allow to restart the apply worker. I think you might want to
>> increase the values of
>> max_sync_workers_per_subscription/max_logical_replication_workers to
>> make it work.
>>
>
> Hmmm. So the theory is that in most runs we manage to sync the tables
> faster than starting the workers, so we don't hit the limit. But on some
> machines the sync worker takes a bit longer, we hit the limit. Seems
> possible, yes. Unfortunately we don't seem to log anything when we hit
> the limit, so hard to say for sure :-( I suggest we add a WARNING
> message to logicalrep_worker_launch or something. Not just because of
> this test, it seems useful in general.
>
> However, how come we don't retry the sync? Surely we don't just give up
> forever, that'd be a pretty annoying behavior. Presumably we just end up
> sleeping for a long time before restarting the sync worker, somewhere.
>

I tried lowering the max_sync_workers_per_subscription to 1 and making
the workers to run for a couple seconds (doing some CPU intensive
stuff), but everything still works just fine.

Looking a bit closer at the logs (from pogona and other), I doubt this
is about hitting the max_sync_workers_per_subscription limit. Notice we
start two sync workers, but neither of them ever completes. So we never
update the sync status or start syncing the remaining tables.

So the question is why those two sync workers never complete - I guess
there's some sort of lock wait (deadlock?) or infinite loop.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-03-18 17:12:40 Re: Remove INT64_FORMAT in translatable strings
Previous Message Alvaro Herrera 2022-03-18 16:38:33 Re: a misbehavior of partition row movement (?)