| From: | Timur Magomedov <t(dot)magomedov(at)postgrespro(dot)ru> |
|---|---|
| To: | Peter Smith <smithpb2250(at)gmail(dot)com> |
| Cc: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(at)vondra(dot)me>, "Aya Iwata (Fujitsu)" <iwata(dot)aya(at)fujitsu(dot)com>, Japin Li <japinli(at)hotmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com> |
| Subject: | Re: [WIP]Vertical Clustered Index (columnar store extension) - take2 |
| Date: | 2025-11-12 16:11:51 |
| Message-ID: | 36cedffdfcac437afd692442cf9c1d16d7f28b01.camel@postgrespro.ru |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello Peter!
I've succeeded in making a reproducer for a infrequent bug I've seen
several times with ROS control daemon enabled.
Looks like WAL records produced by ROS control daemon while processing
"vci_rc_update_del_vec" command are not compatible with what
heap_xlog_prune_freeze() function expects to read from WAL. Those
records are produced in cleanUpWos(), specific line looks like "recptr
= XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_ON_ACCESS);".
The reproducing steps can look tricky, any ideas of improving them are
welcome. This would ideally be a TAP test. For now I just patch code so
that ROS daemon terminates rignt after "update delete vector" command,
kill all postgress processes and next time PostgreSQL is started it
catches assertion inside heap_xlog_prune_freeze() function.
This is the reproduction routine in four steps:
1. Patch VCI using vci_always_fail_update_delete_vector.patch and build
it.
2. Setup VCI in config file (ros_control_daemon enabled):
shared_preload_libraries = 'vci'
max_worker_processes = 20
vci.table_rows_threshold = 0
vci.cost_threshold = 0
vci.enable_ros_control_daemon = true
3. Run reproducer.sh script that runs pgbench on VCI-enabled table and
terminates all postgres processes immediately using "killall -s 9
postgres" after pgbench failed. "pg_ctl stop" can't terminate
PostgreSQL here. "update delete vector" command is usually executed in
less than ten minutes on my system but it needs to wait some time.
4. Here we are with some WAL records on storage that (at least on my
machine) PostgreSQL is unable to apply and fails the assertion:
$ pg_ctl start
...
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/017689D8
..TRAP: failed Assert("do_prune || nplans > 0 || vmflags &
VISIBILITYMAP_VALID_BITS"), File: "heapam_xlog.c", Line: 117, PID:
841207
Debugger shows data actually contains some offsets, in order, but the
format and flags combination are unexpected:
#6 0x000055fad895d1f7 in heap_xlog_prune_freeze
(record=0x55faf23f0ce0) at heapam_xlog.c:117
117 Assert(do_prune || nplans > 0 || vmflags &
VISIBILITYMAP_VALID_BITS);
(gdb) print dataptr
$1 = 0x7072b5e18248 "\001"
(gdb) print datalen
$2 = 370
(gdb) print frz_offsets
$3 = (OffsetNumber *) 0x7072b5e18248
(gdb) print *frz_offsets(at)185
$4 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88,
89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104,
105,
106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133,
134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147,
148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161,
162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175,
176, 177, 178, 179, 180, 181, 182, 183, 184, 185}
(gdb) print frz_offsets==dataptr
$5 = 1
I also attached backtrace from GDB.
I don't understand yet how to fix this and the reproducing is clunky so
any ideas are welcome.
Does this reproduce on your system too? Is it some known problem?
--
Regards,
Timur Magomedov
| Attachment | Content-Type | Size |
|---|---|---|
| vci_always_fail_update_delete_vector.patch | text/x-patch | 559 bytes |
| reproducer.sh | application/x-shellscript | 277 bytes |
| gdb_backtrace.txt | text/plain | 1.8 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Álvaro Herrera | 2025-11-12 16:15:46 | Re: [PATCH] Add pg_get_tablespace_ddl() function to reconstruct CREATE TABLESPACE statement |
| Previous Message | Vitaly Davydov | 2025-11-12 15:55:14 | RE: Newly created replication slot may be invalidated by checkpoint |