Re: [WIP]Vertical Clustered Index (columnar store extension) - take2

From: Timur Magomedov <t(dot)magomedov(at)postgrespro(dot)ru>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(at)vondra(dot)me>, "Aya Iwata (Fujitsu)" <iwata(dot)aya(at)fujitsu(dot)com>, Japin Li <japinli(at)hotmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
Date: 2025-11-12 16:11:51
Message-ID: 36cedffdfcac437afd692442cf9c1d16d7f28b01.camel@postgrespro.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Peter!
I've succeeded in making a reproducer for a infrequent bug I've seen
several times with ROS control daemon enabled.
Looks like WAL records produced by ROS control daemon while processing
"vci_rc_update_del_vec" command are not compatible with what
heap_xlog_prune_freeze() function expects to read from WAL. Those
records are produced in cleanUpWos(), specific line looks like "recptr
= XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_ON_ACCESS);".
The reproducing steps can look tricky, any ideas of improving them are
welcome. This would ideally be a TAP test. For now I just patch code so
that ROS daemon terminates rignt after "update delete vector" command,
kill all postgress processes and next time PostgreSQL is started it
catches assertion inside heap_xlog_prune_freeze() function.

This is the reproduction routine in four steps:

1. Patch VCI using vci_always_fail_update_delete_vector.patch and build
it.

2. Setup VCI in config file (ros_control_daemon enabled):
shared_preload_libraries = 'vci'
max_worker_processes = 20
vci.table_rows_threshold = 0
vci.cost_threshold = 0
vci.enable_ros_control_daemon = true

3. Run reproducer.sh script that runs pgbench on VCI-enabled table and
terminates all postgres processes immediately using "killall -s 9
postgres" after pgbench failed. "pg_ctl stop" can't terminate
PostgreSQL here. "update delete vector" command is usually executed in
less than ten minutes on my system but it needs to wait some time.

4. Here we are with some WAL records on storage that (at least on my
machine) PostgreSQL is unable to apply and fails the assertion:

$ pg_ctl start
...
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/017689D8
..TRAP: failed Assert("do_prune || nplans > 0 || vmflags &
VISIBILITYMAP_VALID_BITS"), File: "heapam_xlog.c", Line: 117, PID:
841207

Debugger shows data actually contains some offsets, in order, but the
format and flags combination are unexpected:
#6 0x000055fad895d1f7 in heap_xlog_prune_freeze
(record=0x55faf23f0ce0) at heapam_xlog.c:117
117 Assert(do_prune || nplans > 0 || vmflags &
VISIBILITYMAP_VALID_BITS);
(gdb) print dataptr
$1 = 0x7072b5e18248 "\001"
(gdb) print datalen
$2 = 370
(gdb) print frz_offsets
$3 = (OffsetNumber *) 0x7072b5e18248
(gdb) print *frz_offsets(at)185
$4 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88,
89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104,
105,
106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133,
134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147,
148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161,
162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175,
176, 177, 178, 179, 180, 181, 182, 183, 184, 185}
(gdb) print frz_offsets==dataptr
$5 = 1

I also attached backtrace from GDB.

I don't understand yet how to fix this and the reproducing is clunky so
any ideas are welcome.
Does this reproduce on your system too? Is it some known problem?

--
Regards,
Timur Magomedov

Attachment Content-Type Size
vci_always_fail_update_delete_vector.patch text/x-patch 559 bytes
reproducer.sh application/x-shellscript 277 bytes
gdb_backtrace.txt text/plain 1.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Herrera 2025-11-12 16:15:46 Re: [PATCH] Add pg_get_tablespace_ddl() function to reconstruct CREATE TABLESPACE statement
Previous Message Vitaly Davydov 2025-11-12 15:55:14 RE: Newly created replication slot may be invalidated by checkpoint