Re: Column Filtering in Logical Replication

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>
Subject: Re: Column Filtering in Logical Replication
Date: 2022-03-29 11:03:34
Message-ID: bd11879e-b5a7-1dce-78d8-2649779d7554@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/29/22 12:00, Amit Kapila wrote:
> On Sun, Mar 20, 2022 at 4:53 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> On 3/20/22 07:23, Amit Kapila wrote:
>>> On Sun, Mar 20, 2022 at 8:41 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>>>
>>>> On Fri, Mar 18, 2022 at 10:42 PM Tomas Vondra
>>>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>>>
>>>>> So the question is why those two sync workers never complete - I guess
>>>>> there's some sort of lock wait (deadlock?) or infinite loop.
>>>>>
>>>>
>>>> It would be a bit tricky to reproduce this even if the above theory is
>>>> correct but I'll try it today or tomorrow.
>>>>
>>>
>>> I am able to reproduce it with the help of a debugger. Firstly, I have
>>> added the LOG message and some While (true) loops to debug sync and
>>> apply workers. Test setup
>>>
>>> Node-1:
>>> create table t1(c1);
>>> create table t2(c1);
>>> insert into t1 values(1);
>>> create publication pub1 for table t1;
>>> create publication pu2;
>>>
>>> Node-2:
>>> change max_sync_workers_per_subscription to 1 in potgresql.conf
>>> create table t1(c1);
>>> create table t2(c1);
>>> create subscription sub1 connection 'dbname = postgres' publication pub1;
>>>
>>> Till this point, just allow debuggers in both workers just continue.
>>>
>>> Node-1:
>>> alter publication pub1 add table t2;
>>> insert into t1 values(2);
>>>
>>> Here, we have to debug the apply worker such that when it tries to
>>> apply the insert, stop the debugger in function apply_handle_insert()
>>> after doing begin_replication_step().
>>>
>>> Node-2:
>>> alter subscription sub1 set pub1, pub2;
>>>
>>> Now, continue the debugger of apply worker, it should first start the
>>> sync worker and then exit because of parameter change. All of these
>>> debugging steps are to just ensure the point that it should first
>>> start the sync worker and then exit. After this point, table sync
>>> worker never finishes and log is filled with messages: "reached
>>> max_sync_workers_per_subscription limit" (a newly added message by me
>>> in the attached debug patch).
>>>
>>> Now, it is not completely clear to me how exactly '013_partition.pl'
>>> leads to this situation but there is a possibility based on the LOGs
>>> it shows.
>>>
>>
>> Thanks, I'll take a look later.
>>
>
> This is still failing [1][2].
>
> [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2022-03-28%2005%3A16%3A53
> [2] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-24%2013%3A13%3A08
>

AFAICS we've concluded this is a pre-existing issue, not something
introduced by a recently committed patch, and I don't think there's any
proposal how to fix that. So I've put that on the back burner until
after the current CF.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Rijkers 2022-03-29 11:29:15 Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
Previous Message Matthias van de Meent 2022-03-29 10:50:09 Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)