From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, vignesh C <vignesh21(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Subject: | Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages |
Date: | 2025-07-19 22:30:49 |
Message-ID: | CAPpHfdtguXBVnCF=oFsWeFGa7AdG0XnnofcLXLTBOiMHAOFyrQ@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Jul 19, 2025 at 10:49 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> > I went trough the patchset. Everything looks good to me. I only did
> > some improvements to comments and commit messages. I'm going to push
> > this if no objections.
>
> There's apparently something wrong in the v17 branch, as three
> separate buildfarm members have now hit timeout failures in
> 046_checkpoint_logical_slot.pl [1][2][3]. I tried to reproduce
> this locally, and didn't have much luck initially. However,
> if I build with a configuration similar to grassquit's, it
> will hang up maybe one time in ten:
>
> export ASAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0:detect_stack_use_after_return=0'
>
> export UBSAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1'
>
> ./configure ... usual flags plus ... CFLAGS='-O1 -ggdb -g3 -fno-omit-frame-pointer -Wall -Wextra -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers -fsanitize=address -fno-sanitize-recover=all' --enable-injection-points
>
> The fact that 046_checkpoint_logical_slot.pl is skipped in
> non-injection-point builds is probably reducing the number
> of buildfarm failures, since only a minority of animals
> have that turned on yet.
>
> I don't see anything obviously wrong in the test changes, and the
> postmaster log from the failures looks pretty clearly like what is
> hanging up is the pg_logical_slot_get_changes call:
>
> 2025-07-19 16:10:07.276 CEST [3458309][client backend][0/2:0] LOG: statement: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
> 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG: starting logical decoding for slot "slot_logical"
> 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL: Streaming transactions committing after 0/290000F8, reading WAL from 0/1540F40.
> 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
> 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG: logical decoding found consistent point at 0/1540F40
> 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL: There are no running transactions.
> 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
> 2025-07-19 16:59:56.828 CEST [3458140][postmaster][:0] LOG: received immediate shutdown request
> 2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] LOG: could not send data to client: Broken pipe
> 2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
> 2025-07-19 16:59:56.851 CEST [3458140][postmaster][:0] LOG: database system is shut down
>
> So my impression is that the bug is not reliably fixed in 17.
>
> One other interesting thing is that once it's hung, the test does
> not stop after PG_TEST_TIMEOUT_DEFAULT elapses. You can see
> above that olingo took nearly 50 minutes to give up, and in
> manual testing it doesn't seem to stop either (though I've not
> got the patience to wait 50 minutes...)
Thank you for pointing!
Apparently I've backpatched d3917d8f13e7 everywhere but not in
REL_17_STABLE. Will be fixed now.
------
Regards,
Alexander Korotkov
Supabase
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2025-07-19 22:57:10 | Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages |
Previous Message | Tom Lane | 2025-07-19 19:49:01 | Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages |