RE: Perform streaming logical transactions by background workers and parallel apply

From: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Perform streaming logical transactions by background workers and parallel apply
Date: 2022-12-11 11:44:55
Message-ID: OS0PR01MB571663B4EC5D8BE749AACD98941E9@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Friday, December 9, 2022 3:14 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Dec 8, 2022 at 12:37 PM houzj(dot)fnst(at)fujitsu(dot)com
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
>
> Review comments

Thanks for the comments!

> ==============
> 1. Currently, we don't release the stream lock in LA (leade apply
> worker) for "rollback to savepoint" and the reason is mentioned in comments of
> apply_handle_stream_abort() in the patch. But, today, while testing, I found that
> can lead to deadlock which otherwise, won't happen on the publisher. The key
> point is rollback to savepoint releases the locks acquired by the particular
> subtransaction, so parallel apply worker should also do the same. Consider the
> following example where the transaction in session-1 is being performed by the
> parallel apply worker and the transaction in session-2 is being performed by the
> leader apply worker. I have simulated it by using GUC force_stream_mode.
> Publisher
> ==========
> Session-1
> postgres=# begin;
> BEGIN
> postgres=*# savepoint s1;
> SAVEPOINT
> postgres=*# truncate t1;
> TRUNCATE TABLE
>
> Session-2
> postgres=# begin;
> BEGIN
> postgres=*# insert into t1 values(4);
>
> Session-1
> postgres=*# rollback to savepoint s1;
> ROLLBACK
>
> Session-2
> Commit;
>
> With or without commit of Session-2, this scenario will lead to deadlock on the
> subscriber because PA (parallel apply worker) is waiting for LA to send the next
> command, and LA is blocked by Exclusive of PA. There is no deadlock on the
> publisher because rollback to savepoint will release the lock acquired by
> truncate.
>
> To solve this, How about if we do three things before sending abort of
> sub-transaction (a) unlock the stream lock, (b) increment pending_stream_count,
> (c) take the stream lock again?
>
> Now, if the PA is not already waiting on the stop, it will not wait at stream_stop
> but will wait after applying abort of sub-transaction and if it is already waiting at
> stream_stop, the wait will be released. If this works then probably we should try
> to do (b) before (a) to match the steps with stream_start.

The solution works for me, I have changed the code as suggested.

> 2. There seems to be another general problem in the way the patch waits for
> stream_stop in PA (parallel apply worker). Currently, PA checks, if there are no
> more pending streams then it tries to wait for the next stream by waiting on a
> stream lock. However, it is possible after PA checks there is no pending stream
> and before it actually starts waiting on a lock, the LA sends another stream for
> which even stream_stop is sent, in this case, PA will start waiting for the next
> stream whereas there is actually a pending stream available. In this case, it won't
> lead to any problem apart from delay in applying the changes in such cases but
> for the case mentioned in the previous point (Pont 1), it can lead to deadlock
> even after we implement the solution proposed to solve it.

Thanks for reporting, I have introduced another flag in shared memory and use it to
prevent the leader from incrementing the pending_stream_count if the parallel
apply worker is trying to lock the stream lock.

> 3. The other point to consider is that for stream_commit/prepare/abort, in LA, we
> release the stream lock after sending the message whereas for stream_start we
> release it before sending the message. I think for the earlier cases
> (stream_commit/prepare/abort), the patch has done like this because
> pa_send_data() may need to require the lock again when it times out and start
> serializing, so there will be no sense in first releasing it, then re-acquiring it, and
> then again releasing it. Can't we also release the lock for stream_start after
> pa_send_data() only if it is not switched to serialize mode?

Changed.

Attach the new version patch set which addressed above comments.
Besides, the new version patch will try to stop extra parallel workers if user
sets the max_parallel_apply_workers_per_subscription to a lower number.

Best regards,
Hou zj

Attachment Content-Type Size
v59-0003-Test-streaming-parallel-option-in-tap-test.patch application/octet-stream 80.1 KB
v59-0006-Retry-to-apply-streaming-xact-only-in-apply-work.patch application/octet-stream 22.6 KB
v59-0002-Serialize-partial-changes-to-a-file-when-the-att.patch application/octet-stream 42.3 KB
v59-0004-Allow-streaming-every-change-without-waiting-til.patch application/octet-stream 5.3 KB
v59-0005-Add-GUC-stream_serialize_threshold-and-test-seri.patch application/octet-stream 13.9 KB
v59-0007-Add-a-main_worker_pid-to-pg_stat_subscription.patch application/octet-stream 8.7 KB
v59-0001-Perform-streaming-logical-transactions-by-parall.patch application/octet-stream 193.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2022-12-11 14:21:58 Re: ExecRTCheckPerms() and many prunable partitions (checkAsUser)
Previous Message Amit Langote 2022-12-11 09:25:48 Re: ExecRTCheckPerms() and many prunable partitions (checkAsUser)