| From: | Glauber Batista <glauberrbatista(at)gmail(dot)com> |
|---|---|
| To: | pgsql-bugs(at)lists(dot)postgresql(dot)org |
| Subject: | Autoprewarm workers terminated due to a segmentation fault |
| Date: | 2026-06-09 18:37:24 |
| Message-ID: | CAO+_mTQgQyTYwDh=U8iTnsDmOGyWsZJjUV31SmEYwmw6_xY6Bw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-bugs |
Hello,
I have an issue with the autoprewarm workers segfaulting during the service
restart. Sometimes, it successfully restarts after a few tries, but usually
I need to remove the autoprewarm.blocks file. My setup consists of a
primary server with two replicas and all of them present the same issue. I
have been using this setup for several years with no issues, but since I
upgraded to Postgres 18 I'm having it. This is a production database.
Details:
Postgres Version: PostgreSQL 18.3 on aarch64-unknown-linux-gnu, compiled by
gcc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0, 64-bit
This is the related settings I'm using in postgresql.conf
```
shared_preload_libraries = 'pg_stat_statements,pg_prewarm'
pg_prewarm.autoprewarm = True
pg_prewarm.autoprewarm_interval = 300s
```
I'm using systemd to manage Postgres, but it also happens if I start
postgres using `pg_ctl`. So I ruled out a systemd issue.
This is the error message I'm seeing.
```
LOG: restored log file "000000010000079200000015" from archive
LOG: consistent recovery state reached at 792/15A586C8
LOG: database system is ready to accept read-only connections
LOG: restored log file "000000010000079200000016" from archive
LOG: restored log file "000000010000079200000017" from archive
LOG: background worker "autoprewarm worker" (PID 2350) was terminated by
signal 11: Segmentation fault
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted while in recovery at log time
2026-06-09 17:34:19 UTC
HINT: If this has occurred more than once some data might be corrupted and
you might need to choose an earlier recovery target.
LOG: restored log file "00000001000007920000000D" from archive
LOG: restored log file "0000000100000791000000F8" from archive
LOG: entering standby mode
LOG: redo starts at 791/F804CB38
LOG: database system is ready to accept read-only connections
LOG: restored log file "0000000100000791000000F9" from archive
LOG: restored log file "0000000100000791000000FA" from archive
LOG: restored log file "0000000100000791000000FB" from archive
LOG: background worker "autoprewarm worker" (PID 2522) was terminated by
signal 11: Segmentation fault
LOG: terminating any other active server processes
FATAL: could not restore file "0000000100000791000000FC" from archive:
child process was terminated by signal 3: Quit
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted while in recovery at log time
2026-06-09 17:34:19 UTC
HINT: If this has occurred more than once some data might be corrupted and
you might need to choose an earlier recovery target.
LOG: restored log file "00000001000007920000000D" from archive
LOG: restored log file "0000000100000791000000F8" from archive
LOG: entering standby mode
```
Changing the debug level to DEBUG1 didn't show anything useful, so I'm not
pasting it here.
At first, I thought it could be somehow related to this bug:
https://www.datadoghq.com/blog/engineering/unraveling-a-postgres-segfault/,
but investigating a bit it does not seem to be the case. Either way, I
compiled Postgres again using CFLAGS="-O0 -g -fno-strict-aliasing" to check
if it was something related to the code optimization for ARM64, but the
issue persisted.
Then, I used gdb to get the core file. Since it has been quite some time
since I didn't debug anything written in C/C++, I used Claude to guide me.
Here's some info:
```
(gdb) frame 0
#0 0x0000f5a0c6003854 in autoprewarm_database_main (main_arg=0) at
autoprewarm.c:649
649 blk = block_info[i];
(gdb) list
644
645 read_stream_end(stream);
646
647 /* Advance i past all the blocks just prewarmed. */
648 i = p.pos;
649 blk = block_info[i];
650 }
651
652 relation_close(rel, AccessShareLock);
653 CommitTransactionCommand();
(gdb) p *block_info
$1 = {database = 0, tablespace = 0, filenumber = 0, forknum = MAIN_FORKNUM,
blocknum = 0}
(gdb) p p
$2 = {block_info = 0xf5a07f600000, pos = 131072, tablespace = 1663,
filenumber = 28197, forknum = MAIN_FORKNUM, nblocks = 65329}
(gdb) p stream
$3 = (ReadStream *) 0xb82a946671f0
(gdb) p *(ReadStream *) stream
$4 = {max_ios = 0, io_combine_limit = 0, ios_in_progress = 0, queue_size =
0, max_pinned_buffers = 225, forwarded_buffers = 0, pinned_buffers = 0,
distance = 1,
initialized_buffers = 9, read_buffers_flags = 0, sync_mode = false,
batch_mode = true, advice_enabled = false, temporary = false,
buffered_blocknum = 4294967295,
callback = 0xf5a0c600330c <apw_read_stream_next_block>,
callback_private_data = 0xfffff758cea8, seq_blocknum = 2403,
seq_until_processed = 4294967295,
pending_read_blocknum = 2403, pending_read_nblocks = 0,
per_buffer_data_size = 0, per_buffer_data = 0x0, ios = 0xb82a94667618,
oldest_io_index = 8, next_io_index = 8,
fast_path = false, oldest_buffer_index = 9, next_buffer_index = 9,
buffers = 0xb82a94667254}
(gdb) p p.pos
$5 = 131072
(gdb) p p.nblocks
$6 = 65329
(gdb) p apw_state->prewarm_stop_idx
$7 = 0
(gdb) p apw_state->prewarm_start_idx
$8 = 0
(gdb) p block_info[131072]
Cannot access memory at address 0xf5a07f880000
(gdb) p &block_info[131072]
$9 = (BlockInfoRecord *) 0xf5a07f880000
(gdb) p block_info[131071]
$10 = {database = 0, tablespace = 0, filenumber = 0, forknum =
MAIN_FORKNUM, blocknum = 0}
(gdb) p block_info[131070]
$11 = {database = 0, tablespace = 0, filenumber = 0, forknum =
MAIN_FORKNUM, blocknum = 0}
(gdb) p block_info[1]
$12 = {database = 0, tablespace = 0, filenumber = 0, forknum =
MAIN_FORKNUM, blocknum = 0}
(gdb) p apw_state->prewarmed_blocks
$13 = 0
(gdb) p *apw_state
$14 = {lock = {tranche = 0, state = {value = 0}, waiters = {head = 0, tail
= 0}}, bgworker_pid = 0, pid_using_dumpfile = 0, block_info_handle = 0,
database = 0,
prewarm_start_idx = 0, prewarm_stop_idx = 0, prewarmed_blocks = 0}
(gdb) info line autoprewarm.c:649
Line 649 of "autoprewarm.c" starts at address 0xf5a0c600382c
<autoprewarm_database_main+936> and ends at 0xf5a0c600384c
<autoprewarm_database_main+968>.
(gdb) list autoprewarm.c:640,660
640 {
641 apw_state->prewarmed_blocks++;
642 ReleaseBuffer(buf);
643 }
644
645 read_stream_end(stream);
646
647 /* Advance i past all the blocks just prewarmed. */
648 i = p.pos;
649 blk = block_info[i];
650 }
651
652 relation_close(rel, AccessShareLock);
653 CommitTransactionCommand();
654 }
655
656 dsm_detach(seg);
657 }
658
659 /*
660 * Dump information on blocks in shared buffers. We use a text
format here
```
I found out that some parts of the autoprewarm were re-written recently
here:
https://www.postgresql.org/message-id/flat/CAN55FZ3n8Gd%2BhajbL%3D5UkGzu_aHGRqnn%2BxktXq2fuds%3D1AOR6Q%40mail.gmail.com
and I think it could be related, given the data present in the dump.
Also, I inspected my data to ensure it was not the culprit. I got the
database and filenumber from `bt full` and run the following queries.
```
blk = {database = 23583, tablespace = 1663, filenumber = 28197, forknum =
MAIN_FORKNUM, blocknum = 49}
```
So I queried the database (23583) and the filenumber (28197) using
```
SELECT datname FROM pg_database WHERE oid = 23583;
\c <datname>
SELECT relname, relkind, relpages, pg_relation_filenode(oid) AS filenode
FROM pg_class
WHERE pg_relation_filenode(oid) = 28197;
```
and it returned
```
relname | relkind | relpages | filenode
--------------+---------+----------+----------
lccss_cc_idx | i | 64805 | 28197
```
So I think it's not stale data or shrunk relation caused by VACUUM,
TRUNCATE, etc.
I'm only reporting with the latest data I collected, but this issue has
been happening since April, when I upgraded the database.
All that said, it seems there's a missing guard-clause at line 649. I
didn't spend much time reading the code, but it's clearly accessing a
position in the array that is not allocated.
Let me know if any extra information is needed.
Best,
Glauber Cassiano Batista
| Attachment | Content-Type | Size |
|---|---|---|
| bt_full.txt | text/plain | 8.1 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Matheus Alcantara | 2026-06-09 21:06:09 | Re: Autoprewarm workers terminated due to a segmentation fault |
| Previous Message | Alvaro Herrera | 2026-06-09 18:18:01 | Re: BUG #19500: pgrepack logical decoding plugin can crash assert builds via SQL decoding API |