Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date: 2020-12-02 07:50:40
Message-ID: CAFiTN-sFPUKbT2EhPro1WVCzgR+MrcKZ3qeEMuJrmvPubWYCNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > >
> > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > > > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > > > see the buildfarm reports
> > >
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> > > failed the new 015_stream.pl test with the subscriber looping like this:
> > >
> > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started
> > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
> > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started
> > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exited with exit code 1
> > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
> > > ...
> > >
> > > What happened there?
> > >
> >
> > What is going on here is that the expected streaming file is missing.
> > Normally, the first time we send a stream of changes (some percentage
> > of transaction changes) we create the streaming file, and then in
> > respective streams we just keep on writing in that file the changes we
> > receive from the publisher, and on commit, we read that file and apply
> > all the changes.
> >
> > The above kind of error can happen due to the following reasons: (a)
> > the first time we sent the stream and created the file and that got
> > removed before the second stream reached the subscriber. (b) from the
> > publisher-side, we never sent the indication that it is the first
> > stream and the subscriber directly tries to open the file thinking it
> > is already there.
> >
> > Now, the publisher and subscriber log doesn't directly indicate any of
> > the above problems but I have some observations.
> >
> > The subscriber log indicates that before the apply worker exits due to
> > an error the new apply worker gets started. We delete the
> > streaming-related temporary files on proc_exit, so one possibility
> > could have been that the new apply worker has created the streaming
> > file which the old apply worker has removed but that is not possible
> > because we always create these temp-files by having procid in the
> > path.
>
> Yeah, and I have tried to test on this line, basically, after the
> streaming has started I have set the binary=on. Now using gdb I have
> made the worker wait before it deletes the temp file and meanwhile the
> new worker started and it worked properly as expected.
>
> > The other thing I observed in the code is that we can mark the
> > transaction as streamed (via ReorderBufferTruncateTxn) if we try to
> > stream a transaction that has no changes the first time we try to
> > stream the transaction. This would lead to symptom (b) because the
> > second-time when there are more changes we would stream the changes as
> > it is not the first time. However, this shouldn't happen because we
> > never pick-up a transaction to stream which has no changes. I can try
> > to fix the code here such that we don't mark the transaction as
> > streamed unless we have streamed at least one change but I don't see
> > how it is related to this particular test failure.
>
> Yeah, this can be improved but as you mentioned that we never select
> an empty transaction for streaming so this case should not occur. I
> will perform some testing/review around this and report.

I have executed "make check" in the loop with only this file. I have
repeated it 5000 times but no failure, I am wondering shall we try to
execute in the same machine in a loop where it failed once?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-12-02 08:20:25 Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Previous Message Michael Paquier 2020-12-02 07:42:16 Re: Commitfest 2020-11 is closed