From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Subject: | Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages |
Date: | 2025-06-30 12:21:51 |
Message-ID: | CALDaNm0tw=1uGehCzN177RAQWfgqMZOxdB5SwYf_wTK=5sLqUA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, 30 Jun 2025 at 17:41, Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear Vignesh,
>
> > I was unable to reproduce the same test failure on the PG17 branch,
> > even after running the test around 500 times. However, on the master
> > branch, the failure consistently reproduces approximately once in
> > every 50 runs. I also noticed that while the buildfarm has reported
> > multiple failures for this test for the master branch, none of them
> > appear to be on the PG17 branch. I'm not yet sure why this discrepancy
> > exists.
>
> I was also not able to reproduce as-is. After analyzing bit more, I found on
> PG17, the workload cannot generate an FPI_FOR_HINT. The type of WAL record
> has longer length than the page there was a possibility that the WAL record
> could be flushed partially in HEAD. But in PG17 it could not happen so that
> OVERWRITE_CONTRECORD won't be appeared.
>
> I modified the test code like [1] and confirmed that the same stuck could happen
> on PG17. It generates a long record which can go across the page and can be
> flushed partially.
>
> [1]:
> ```
> --- a/src/test/recovery/t/046_checkpoint_logical_slot.pl
> +++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl
> @@ -123,6 +123,10 @@ $node->safe_psql('postgres',
> $node->safe_psql('postgres',
> q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
>
> +# Generate a long WAL record
> +$node->safe_psql('postgres',
> + q{select pg_logical_emit_message(false, '', repeat('123456789', 1000))});
> ```
Thanks, Kuroda-san. I’ve prepared a similar test that doesn’t rely on
injection points. The issue reproduced consistently across all
branches up to PG13. You can use the attached
049_slot_get_changes_wait_continously_pg17.pl script (found in the
049_slot_get_changes_wait_continously_pg17.zip file) to verify this.
Just copy the script to src/test/recovery and run the test to observe
the problem.
The included patch addresses the issue. Use
v3_PG17-0001-Fix-infinite-wait-when-reading-partially-written-.patch
for PG17, PG16, and PG15, and
v3_PG14-0001-Fix-infinite-wait-when-reading-partially-written-.patch
for PG14 and PG13.
Regards,
Vignesh
Attachment | Content-Type | Size |
---|---|---|
049_slot_get_changes_wait_continously_pg17.zip | application/x-zip-compressed | 1.5 KB |
v3_PG14-0001-Fix-infinite-wait-when-reading-partially-written-.patch | application/octet-stream | 2.4 KB |
v3-0001-Fix-infinite-wait-when-reading-partially-written-.patch | application/octet-stream | 7.6 KB |
v3_PG17-0001-Fix-infinite-wait-when-reading-partially-written-.patch | application/octet-stream | 2.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Sami Imseih | 2025-06-30 12:37:08 | Re: Improve explicit cursor handling in pg_stat_statements |
Previous Message | Hayato Kuroda (Fujitsu) | 2025-06-30 12:11:30 | RE: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages |