RE: Perform streaming logical transactions by background workers and parallel apply

From: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Perform streaming logical transactions by background workers and parallel apply
Date: 2023-01-06 04:07:49
Message-ID: OS0PR01MB5716BA08EABE25F07B61EB6D94FB9@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, January 5, 2023 7:54 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Jan 5, 2023 at 5:03 PM houzj(dot)fnst(at)fujitsu(dot)com
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > On Thursday, January 5, 2023 4:22 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com>
> wrote:
> > >
>
> > Thanks for reporting the problem.
> >
> > After analyzing the behavior, I think it's a bug on publisher side
> > which is not directly related to parallel apply.
> >
> > I think the root reason is that we didn't try to send a stream
> > end(stream
> > abort) message to subscriber for the crashed transaction which was
> > streamed before.
> > The behavior is that, after restarting, the publisher will start to
> > decode the transaction that aborted due to crash, and when try to
> > stream the first change of that transaction, it will send a stream
> > start message but then it realizes that the transaction was aborted,
> > so it will enter the PG_CATCH block of
> > ReorderBufferProcessTXN() and call ReorderBufferResetTXN() which send
> > the stream stop message. And in this case, there would be a parallel
> > apply worker started on subscriber waiting for stream end message which
> will never come.
>
> I suspected it but didn't analyze this.
>
> > I think the same behavior happens for the non-parallel mode which will
> > cause a stream file left on subscriber and will not be cleaned until
> > the apply worker is restarted.
> > To fix it, I think we need to send a stream abort message when we are
> > cleaning up crashed transaction on publisher(e.g., in
> > ReorderBufferAbortOld()). And here is a tiny patch which change the
> > same. I have confirmed that the bug is fixed and all regression tests pass.
> >
> > What do you think ?
> > I will start a new thread and try to write a testcase if possible
> > after reaching a consensus.
>
> I think your analysis looks correct and we can raise this in a new thread.

Thanks, I have started another thread[1]

Attach the parallel apply patch set here again. I didn't change the patch set,
attach it here just to let the CFbot keep testing it.

[1] https://www.postgresql.org/message-id/OS0PR01MB5716A773F46768A1B75BE24394FB9%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best regards,
Hou zj

Attachment Content-Type Size
v74-0005-Add-a-main_worker_pid-to-pg_stat_subscription.patch application/octet-stream 9.5 KB
v74-0001-Perform-apply-of-large-transactions-by-parallel-.patch application/octet-stream 264.6 KB
v74-0002-Add-GUC-stream_serialize_threshold-and-test-seri.patch application/octet-stream 12.4 KB
v74-0003-Stop-extra-worker-if-GUC-was-changed.patch application/octet-stream 4.1 KB
v74-0004-Retry-to-apply-streaming-xact-only-in-apply-work.patch application/octet-stream 21.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2023-01-06 04:10:55 Re: Schema variables - new implementation for Postgres 15 (typo)
Previous Message Thomas Munro 2023-01-06 04:07:23 Re: pg_ftruncate hardcodes length=0 but only under windows