Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date: 2020-11-30 13:20:35
Message-ID: CAA4eK1L6OuysEW=63qY-qxAu01h-FK84tHw8+SUpX0VgM6CO4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah(at)leadboat(dot)com> wrote:
>
> On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > see the buildfarm reports
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> failed the new 015_stream.pl test with the subscriber looping like this:
>
> 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started
> 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
> 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started
> 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exited with exit code 1
> 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory
> ...
>
> What happened there?
>

What is going on here is that the expected streaming file is missing.
Normally, the first time we send a stream of changes (some percentage
of transaction changes) we create the streaming file, and then in
respective streams we just keep on writing in that file the changes we
receive from the publisher, and on commit, we read that file and apply
all the changes.

The above kind of error can happen due to the following reasons: (a)
the first time we sent the stream and created the file and that got
removed before the second stream reached the subscriber. (b) from the
publisher-side, we never sent the indication that it is the first
stream and the subscriber directly tries to open the file thinking it
is already there.

Now, the publisher and subscriber log doesn't directly indicate any of
the above problems but I have some observations.

The subscriber log indicates that before the apply worker exits due to
an error the new apply worker gets started. We delete the
streaming-related temporary files on proc_exit, so one possibility
could have been that the new apply worker has created the streaming
file which the old apply worker has removed but that is not possible
because we always create these temp-files by having procid in the
path.

The other thing I observed in the code is that we can mark the
transaction as streamed (via ReorderBufferTruncateTxn) if we try to
stream a transaction that has no changes the first time we try to
stream the transaction. This would lead to symptom (b) because the
second-time when there are more changes we would stream the changes as
it is not the first time. However, this shouldn't happen because we
never pick-up a transaction to stream which has no changes. I can try
to fix the code here such that we don't mark the transaction as
streamed unless we have streamed at least one change but I don't see
how it is related to this particular test failure.

I am not sure why this failure is not repeated since it occurred a few
months back, it's probably a timing issue. I have few timing issues in
the last month or so related to this feature but I am not able to come
up with a theory if any of those would have fixed this problem.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2020-11-30 13:26:19 Re: [HACKERS] [PATCH] Generic type subscripting
Previous Message Ashutosh Bapat 2020-11-30 13:19:03 Re: Printing LSN made easy