From: | Jim Nasby <jnasby(at)upgrade(dot)com> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | "Aya Iwata (Fujitsu)" <iwata(dot)aya(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: [WIP]Vertical Clustered Index (columnar store extension) - take2 |
Date: | 2025-06-04 22:51:16 |
Message-ID: | CAMFBP2piL84fv6cxyd7z+OgKjkBhNW1UDx=NFZamtJ2tc4ik8g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Jun 4, 2025 at 5:19 PM Jim Nasby <jnasby(at)upgrade(dot)com> wrote:
> What I can imagine is "VCI" as a "proxy" TAM on top of heap, keeping the
>> columnar format in a separate fork. And using either that from custom
>> scans, or the heap as a fallback for cases not supported by VCI.
>>
>
> Yeah, there'd definitely need to be some kind of proxy... I'm just
> suggesting that we don't *have* to do that as a separate fork...
>
(tl;dr: there are some key things that can only be implemented in the
engine that would enable much more complex features to be added at the SQL
level, without requiring tons of C code to implement the larger feature)
Oh, one other thing worth mentioning... it's actually not terribly hard to
build a column-store in userspace today: just turn every column of a table
into an array and set the TOAST target low enough so that it all gets
toasted. I tested that many years ago, and even though I couldn't set the
toast target back then saw some really encouraging results... provided that
I constructed my queries carefully (I also had range fields for each column
that stored the min/max of each array, so the planner could completely skip
de-toasting anything that would not contain values of interest.)
The reason I never went anywhere with this concept is it'd be very hard for
most folks to write queries that performed well. The transform from column
back to row-based was actually hidden behind a view (a bunch of unnest()'s)
- but if you didn't make use of the range fields in your query you lost a
lot (but not all[1]) of the performance gain. I know that I could have used
the hooks to teach the planner how to do this, but it would have been a
huge amount of work (at least for me) to do so.
It did occur to me recently that a generic system for teaching the
optimizer additional transforms it could make would be generally useful. By
far the biggest example would be a way to teach it that
WHERE timestamp_field :: date = '2025-6-4'
is the same thing as
WHERE timestamp_field >= '2025-6-4' AND timestamp_field < '2025-6-5'
That would be extremely helpful in a lot of environments. There are
definitely other cases where you can apply the same kinds of logic. In
particular, such a feature (if generic enough) would make it possible to
write simple queries against a view that transformed columnar data (stored
as arrays) back into a row format *and* apply additional predicates that
would make those queries highly efficient - all done via pure SQL.
[1] In my testing (which used the taxicab database) there was still a
performance gain from storing the data as arrays, even if the queries to
access it took no special efforts to eliminate unnecessary data. The reason
is that TOAST meant that the base data was being compressed. In fact,
testing showed that there was a win even if you didn't treat each
individual column as an array; you could simply store an array of a
composite type and still see a win.
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2025-06-04 23:35:23 | Re: Custom Glibc collation version strings under LOCPATH |
Previous Message | Bruce Momjian | 2025-06-04 22:40:11 | Re: PG 18 release notes draft committed |