| From: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
|---|---|
| To: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
| Cc: | Mircea Cadariu <cadariu(dot)mircea(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: pg_recvlogical: Prevent flushed data from being re-sent after restarting replication |
| Date: | 2026-01-12 07:08:21 |
| Message-ID: | 2995625E-BF51-431B-BF17-BAF064005E74@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
> On Jan 7, 2026, at 11:36, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>
> On Mon, Dec 29, 2025 at 9:45 PM Mircea Cadariu <cadariu(dot)mircea(at)gmail(dot)com> wrote:
>>
>> Hi,
>>
>> Thanks for the patch updates.
>>
>> On 26/12/2025 10:28, Fujii Masao wrote:
>>
>> Maybe it's better to use slurp_file(). We already have wait_for_log() to
>> wait for a message in the cluster's log file, but there's no helper function
>> to wait for specific content to appear in an arbitrary file.
>>
>> To support waiting for output in pg_recvlogical's output file,
>> I added a new helper that uses slurp_file() (see the attached 0002 patch).
>> I also updated the 0003 patch (the pg_recvlogical reconnection test) to
>> use this helper instead of pg_read_file(). Thoughts?
>>
>> Agreed, nice addition.
>>
>> I applied the v3-000* patch set and it builds successfully and passes the tests on my laptop.
>>
>> However the CI seems not completely happy yet, with previous 2 runs not green for Windows. Could it be there's an issue with executing the test on Windows?
>
> Thanks for the report!
>
> The TAP test failed on Windows because it attempted to terminate
> pg_recvlogical using a TERM signal, which isn't available there.
> As a result, the test waited indefinitely for pg_recvlogical to exit
> and finally timed out.
>
> To address this, I updated the 0003 patch so that the test passes
> --endpos to pg_recvlogical on Windows only. This allows pg_recvlogical
> to terminate without signals, by generating WAL until the current
> position reaches the specified end position. OTOH, on non-Windows
> platforms, the test continues to use signals to terminate pg_recvlogical.
>
> This approach may be somewhat unstable. If there's a more robust
> way to terminate pg_recvlogical on Windows, I'd be happy to switch
> to it, but I couldn't come up with a better option.
>
> Updated patches are attached.
>
> Regards,
>
> --
> Fujii Masao
> <v4-0004-pg_recvlogical-remove-unnecessary-OutputFsync-ret.patch><v4-0003-Add-test-for-pg_recvlogical-reconnection-behavior.patch><v4-0001-pg_recvlogical-Prevent-flushed-data-from-being-re.patch><v4-0002-Add-a-new-helper-function-wait_for_file-to-Utils..patch>
Hi Fujii-san,
Thanks for the patch. Here are my comments on v4.
1 - 0001
```
+ /*
+ * Save the last flushed position as the replication start point. On
+ * reconnect, replication resumes from there to avoid re-sending flushed
+ * data.
+ */
+ startpos = output_fsync_lsn;
```
Looking at function OutputFsync(), fsync() may fail and there a few branches to return early without fsync(), so should we only update startpos after fsync()?
2 - 0001
```
+ if (outfd != -1 && strcmp(outfile, "-") != 0)
+ OutputFsync(feGetCurrentTimestamp());
```
Do we need to check return code of OutputFsync()? I checked this file, the only caller that doesn’t check return code of OutputFsync() has a comment for why:
```
/* no need to jump to error on failure here, we're finishing anyway */
OutputFsync(t);
```
I saw 0004 has changed OutputFsync to return nothing. So, it’s ok to not adding a comment here. But I just feel that, if we want to make the commit self-contained, it would be better to add a comment here, but that’s not a strong opinion.
3 - 0001
No a direct issue of this patch. I noticed that, everywhere that calls OutputFsync(), it checks outfd != -1 && strcmp(outfile, "-") != 0. However, two places miss the check:
```
if (outfd != -1 &&
feTimestampDifferenceExceeds(output_last_fsync, now,
fsync_interval))
{
if (!OutputFsync(now))
goto error;
}
```
This place doesn’t check strcmp(outfile, "-") != 0.
```
static bool
flushAndSendFeedback(PGconn *conn, TimestampTz *now)
{
/* flush data to disk, so that we send a recent flush pointer */
if (!OutputFsync(*now))
return false;
*now = feGetCurrentTimestamp();
if (!sendFeedback(conn, *now, true, false))
return false;
return true;
}
```
flushAndSendFeedback() doesn’t check the both conditions. I don’t why the checks can be skipped.
I want to hear your opinion. If you consider this is a problem, then you may address it in this patch; or you want me to address it in a separate patch?
4 - 0002
```
+ croak "timed out waiting for match: $regexp”;
```
Is it more helpful to include filename in the error message?
5 - 0003
```
+my ($stdout, $stderr);
+my $recv = IPC::Run::start(
+ [(at)pg_recvlogical_cmd],
+ '>' => \$stdout,
+ '2>' => \$stderr);
```
$stdout and $stderr are never used.
6 - 0004 LGTM. As OutputFsync() only returned true, making it as void makes the code neater.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Smith | 2026-01-12 07:09:35 | Re: Proposal: Conflict log history table for Logical Replication |
| Previous Message | Pavel Stehule | 2026-01-12 06:57:39 | Re: global temporary table (GTT) - are there some ideas how to implement it? |