From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, vignesh C <vignesh21(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Subject: | Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages |
Date: | 2025-07-19 19:49:01 |
Message-ID: | 120120.1752954541@sss.pgh.pa.us |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> I went trough the patchset. Everything looks good to me. I only did
> some improvements to comments and commit messages. I'm going to push
> this if no objections.
There's apparently something wrong in the v17 branch, as three
separate buildfarm members have now hit timeout failures in
046_checkpoint_logical_slot.pl [1][2][3]. I tried to reproduce
this locally, and didn't have much luck initially. However,
if I build with a configuration similar to grassquit's, it
will hang up maybe one time in ten:
export ASAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0:detect_stack_use_after_return=0'
export UBSAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1'
./configure ... usual flags plus ... CFLAGS='-O1 -ggdb -g3 -fno-omit-frame-pointer -Wall -Wextra -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers -fsanitize=address -fno-sanitize-recover=all' --enable-injection-points
The fact that 046_checkpoint_logical_slot.pl is skipped in
non-injection-point builds is probably reducing the number
of buildfarm failures, since only a minority of animals
have that turned on yet.
I don't see anything obviously wrong in the test changes, and the
postmaster log from the failures looks pretty clearly like what is
hanging up is the pg_logical_slot_get_changes call:
2025-07-19 16:10:07.276 CEST [3458309][client backend][0/2:0] LOG: statement: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG: starting logical decoding for slot "slot_logical"
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL: Streaming transactions committing after 0/290000F8, reading WAL from 0/1540F40.
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG: logical decoding found consistent point at 0/1540F40
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL: There are no running transactions.
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
2025-07-19 16:59:56.828 CEST [3458140][postmaster][:0] LOG: received immediate shutdown request
2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] LOG: could not send data to client: Broken pipe
2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
2025-07-19 16:59:56.851 CEST [3458140][postmaster][:0] LOG: database system is shut down
So my impression is that the bug is not reliably fixed in 17.
One other interesting thing is that once it's hung, the test does
not stop after PG_TEST_TIMEOUT_DEFAULT elapses. You can see
above that olingo took nearly 50 minutes to give up, and in
manual testing it doesn't seem to stop either (though I've not
got the patience to wait 50 minutes...)
regards, tom lane
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2025-07-19%2014%3A07%3A23
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2025-07-19%2014%3A07%3A56
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2025-07-19%2016%3A29%3A32
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Korotkov | 2025-07-19 22:30:49 | Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages |
Previous Message | Konstantin Knizhnik | 2025-07-19 18:36:07 | Re: Logical replication prefetch |