Re: pg_recvlogical: Prevent flushed data from being re-sent after restarting replication

From: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Mircea Cadariu <cadariu(dot)mircea(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg_recvlogical: Prevent flushed data from being re-sent after restarting replication
Date: 2026-01-12 07:08:21
Message-ID: 2995625E-BF51-431B-BF17-BAF064005E74@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Jan 7, 2026, at 11:36, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>
> On Mon, Dec 29, 2025 at 9:45 PM Mircea Cadariu <cadariu(dot)mircea(at)gmail(dot)com> wrote:
>>
>> Hi,
>>
>> Thanks for the patch updates.
>>
>> On 26/12/2025 10:28, Fujii Masao wrote:
>>
>> Maybe it's better to use slurp_file(). We already have wait_for_log() to
>> wait for a message in the cluster's log file, but there's no helper function
>> to wait for specific content to appear in an arbitrary file.
>>
>> To support waiting for output in pg_recvlogical's output file,
>> I added a new helper that uses slurp_file() (see the attached 0002 patch).
>> I also updated the 0003 patch (the pg_recvlogical reconnection test) to
>> use this helper instead of pg_read_file(). Thoughts?
>>
>> Agreed, nice addition.
>>
>> I applied the v3-000* patch set and it builds successfully and passes the tests on my laptop.
>>
>> However the CI seems not completely happy yet, with previous 2 runs not green for Windows. Could it be there's an issue with executing the test on Windows?
>
> Thanks for the report!
>
> The TAP test failed on Windows because it attempted to terminate
> pg_recvlogical using a TERM signal, which isn't available there.
> As a result, the test waited indefinitely for pg_recvlogical to exit
> and finally timed out.
>
> To address this, I updated the 0003 patch so that the test passes
> --endpos to pg_recvlogical on Windows only. This allows pg_recvlogical
> to terminate without signals, by generating WAL until the current
> position reaches the specified end position. OTOH, on non-Windows
> platforms, the test continues to use signals to terminate pg_recvlogical.
>
> This approach may be somewhat unstable. If there's a more robust
> way to terminate pg_recvlogical on Windows, I'd be happy to switch
> to it, but I couldn't come up with a better option.
>
> Updated patches are attached.
>
> Regards,
>
> --
> Fujii Masao
> <v4-0004-pg_recvlogical-remove-unnecessary-OutputFsync-ret.patch><v4-0003-Add-test-for-pg_recvlogical-reconnection-behavior.patch><v4-0001-pg_recvlogical-Prevent-flushed-data-from-being-re.patch><v4-0002-Add-a-new-helper-function-wait_for_file-to-Utils..patch>

Hi Fujii-san,

Thanks for the patch. Here are my comments on v4.

1 - 0001
```
+ /*
+ * Save the last flushed position as the replication start point. On
+ * reconnect, replication resumes from there to avoid re-sending flushed
+ * data.
+ */
+ startpos = output_fsync_lsn;
```

Looking at function OutputFsync(), fsync() may fail and there a few branches to return early without fsync(), so should we only update startpos after fsync()?

2 - 0001
```
+ if (outfd != -1 && strcmp(outfile, "-") != 0)
+ OutputFsync(feGetCurrentTimestamp());
```

Do we need to check return code of OutputFsync()? I checked this file, the only caller that doesn’t check return code of OutputFsync() has a comment for why:
```
/* no need to jump to error on failure here, we're finishing anyway */
OutputFsync(t);
```

I saw 0004 has changed OutputFsync to return nothing. So, it’s ok to not adding a comment here. But I just feel that, if we want to make the commit self-contained, it would be better to add a comment here, but that’s not a strong opinion.

3 - 0001

No a direct issue of this patch. I noticed that, everywhere that calls OutputFsync(), it checks outfd != -1 && strcmp(outfile, "-") != 0. However, two places miss the check:

```
if (outfd != -1 &&
feTimestampDifferenceExceeds(output_last_fsync, now,
fsync_interval))
{
if (!OutputFsync(now))
goto error;
}
```
This place doesn’t check strcmp(outfile, "-") != 0.

```
static bool
flushAndSendFeedback(PGconn *conn, TimestampTz *now)
{
/* flush data to disk, so that we send a recent flush pointer */
if (!OutputFsync(*now))
return false;
*now = feGetCurrentTimestamp();
if (!sendFeedback(conn, *now, true, false))
return false;

return true;
}
```
flushAndSendFeedback() doesn’t check the both conditions. I don’t why the checks can be skipped.

I want to hear your opinion. If you consider this is a problem, then you may address it in this patch; or you want me to address it in a separate patch?

4 - 0002
```
+ croak "timed out waiting for match: $regexp”;
```

Is it more helpful to include filename in the error message?

5 - 0003
```
+my ($stdout, $stderr);
+my $recv = IPC::Run::start(
+ [(at)pg_recvlogical_cmd],
+ '>' => \$stdout,
+ '2>' => \$stderr);
```

$stdout and $stderr are never used.

6 - 0004 LGTM. As OutputFsync() only returned true, making it as void makes the code neater.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2026-01-12 07:09:35 Re: Proposal: Conflict log history table for Logical Replication
Previous Message Pavel Stehule 2026-01-12 06:57:39 Re: global temporary table (GTT) - are there some ideas how to implement it?