Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date: 2020-07-12 16:26:34
Message-ID: CAFiTN-sx9O68Lcb7DP5SmPumsdgui3=_XSVqODJnYJpH45mdxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > >
> > > > 9.
> > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > > > {
> > > > ..
> > > > + ReorderBufferToastReset(rb, txn);
> > > > + if (specinsert != NULL)
> > > > + ReorderBufferReturnChange(rb, specinsert);
> > > > ..
> > > > }
> > > >
> > > > Why do we need to do these here when we wouldn't have been done for
> > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
> > >
> > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> > > gracefully and we are continuing with further decoding so we need to
> > > return this change back.
> > >
> >
> > Okay, then I suggest we should do these before calling stream_stop and
> > also move ReorderBufferResetTXN after calling stream_stop to follow a
> > pattern similar to try block unless there is a reason for not doing
> > so. Also, it would be good if we can initialize specinsert with NULL
> > after returning the change as we are doing at other places.
>
> Okay
>
> > > > 10. I have got the below failure once. I have not investigated this
> > > > in detail as the patch is still under progress. See, if you have any
> > > > idea?
> > > > # Failed test 'check extra columns contain local defaults'
> > > > # at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > # got: '2|0'
> > > > # expected: '1000|500'
> > > > # Looks like you failed 1 test of 2.
> > > > make[2]: *** [check] Error 1
> > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > make[1]: *** Waiting for unfinished jobs....
> > > > make: *** [check-world-src/test-recurse] Error 2
> > >
> > > Even I got the failure once and after that, it did not reproduce. I
> > > have executed it multiple time but it did not reproduce again. Are
> > > you able to reproduce it consistently?
> > >
> >
> > No, I am also not able to reproduce it consistently but I think this
> > can fail if a subscriber sends the replay_location before actually
> > replaying the changes. First, I thought that extra send_feedback we
> > have in apply_handle_stream_commit might have caused this but I guess
> > that can't happen because we need the commit time location for that
> > and we are storing the same at the end of apply_handle_stream_commit
> > after applying all messages. I am not sure what is going on here. I
> > think we somehow need to reproduce this or some variant of this test
> > consistently to find the root cause.
>
> And I think it appeared first time for me, so maybe either induced
> from past few versions so some changes in the last few versions might
> have exposed it. I have noticed that almost 50% of the time I am able
> to reproduce after the clean build so I can trace back from which
> version it started appearing that way it will be easy to narrow down.

I think the reason for the failure is that we are not setting
remote_final_lsn, in the streaming mode. I have put multiple logs and
executed in log and from logs it appeared that some of the logical wal
did not get replayed due to below check in
should_apply_changes_for_rel.
return (rel->state == SUBREL_STATE_READY || (rel->state ==
SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));

I still need to do the detailed analysis that why does this fail in
some cases, basically, most of the time the rel->state is
SUBREL_STATE_READY so this check passes but whenever the state is
SUBREL_STATE_SYNCDONE it failed because we never update
remote_final_lsn. I will try to set this value in
apply_handle_stream_commit and see whether it ever fails or not.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-07-12 17:34:03 Re: Online checksums verification in the backend
Previous Message Tom Lane 2020-07-12 15:15:56 Re: Improving psql slash usage help message