| From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | John Naylor <johncnaylorls(at)gmail(dot)com>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: More speedups for tuple deformation |
| Date: | 2026-03-06 04:09:41 |
| Message-ID: | CAApHDvpdB1t7LCgH8=KOKC6VBb2rsEbaas0FiXo5awsRgCsDxQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
One of my goals for proactively populating
CompactAttribute.attcacheoff is to make it so we're able to support
deforming only a subset of columns. If we only need a small number of
columns from the tuple and all those columns have a known attcacheoff
and no NULLs come prior, then we can quite efficiently just go to
those cached offsets and fetch only the attributes that we need. To
do this, we'll need an extra array to store which attnums we're
interested in, rather than deforming all attrs up to the highest
attnum that we need, as we do today. I expect that looking at this
new array will slow things down a bit when we're accessing either most
or all columns in, say, a SELECT * query. So, IMO, it'd be bad to
*replace* the current deforming code with code which does this.
Instead, I propose we add an additional deform operator and have some
heuristic which decides which one is best to use. I expect
ExecPushExprSetupSteps() could make that choice fairly easily. Perhaps
something cheap like bms_num_members(scan_attrs) is less than half the
bms_prev_member(scan_attrs, -1) (the highest member).
There's going to be many cases where the attcacheoff isn't known in
the attributes being selected. So that we still get some gains when
that's the case, I've coded it up so that we start walking the tuple
at the last attribute that has an attcacheoff. In many cases, that'll
mean we don't need to walk the entire tuple. Often, leading columns
are fixed-width, so this means that there's likely some benefit to
most cases. There might need to be a bit more education or
documentation about best column ordering practises.
There are a few hurdles to make this work, and one is the physical
tlist optimization. If the planner replaces the targetlist with a
physical tlist, the executor is going to think we need all columns,
which would have it likely choose not to do the selective deforming.
To make this work, I've added some code in createplan.c to extract the
attnums we need from the qual and tlist before the physical tlist is
installed. That's recorded in a Bitmapset and passed down to the
executor and to the code which sets up the ExprStates. Currently,
mostly to exercise this code as much as possible, I've coded it to
always do the selective deforming when the Bitmapset isn't empty. So
far, I've only done this for Seq Scan, but I expect all the scans that
deform tuples could use this.
I've attached the code which does all this in the 0006 patch.
Ideally, I'd have had this at least to the current state about 2-3
months ago, so I don't intend that 0006 is v19 material, but I wanted
to share to show where I intend this work to go.
Performance:
Using the t_1_40 table from the deform_test_setup.sh script I sent in
[1], running "select a from t_1_40 where a = 0;" ("a" is the 43rd
column in that table), on my Zen2 machine, I get the following from
perf top and pgbench:
master:
75.57% postgres [.] tts_buffer_heap_getsomeattrs
4.70% postgres [.] ExecInterpExpr
2.85% postgres [.] ExecSeqScanWithQualProject
1.94% postgres [.] heapgettup_pagemode
1.21% postgres [.] UnlockBuffer
1.15% postgres [.] slot_getsomeattrs_int
$ for i in {1..3}; do pgbench -n -f bench.sql -M prepared -T 10
postgres | grep latency; done
latency average = 154.175 ms
latency average = 156.780 ms
latency average = 157.599 ms
0001-0005:
64.24% postgres [.] tts_buffer_heap_getsomeattrs
15.01% postgres [.] ExecInterpExpr
3.22% postgres [.] ExecSeqScanWithQualProject
3.01% postgres [.] heapgettup_pagemode
1.57% postgres [.] ExecStoreBufferHeapTuple
1.53% postgres [.] heap_prepare_pagescan
$ for i in {1..3}; do pgbench -n -f bench.sql -M prepared -T 10
postgres | grep latency; done
latency average = 130.981 ms
latency average = 134.700 ms
latency average = 134.898 ms
0001-0006:
42.28% postgres [.] heapgettup_pagemode
11.38% postgres [.] ExecInterpExpr
7.13% postgres [.] ExecSeqScanWithQualProject
5.92% postgres [.] tts_buffer_heap_selectattrs <-- it's down here.
5.69% postgres [.] ExecStoreBufferHeapTuple
5.11% postgres [.] heap_getnextslot
3.87% postgres [.] heap_prepare_pagescan
$ for i in {1..3}; do pgbench -n -f bench.sql -M prepared -T 10
postgres | grep latency; done
latency average = 71.689 ms
latency average = 75.638 ms
latency average = 75.149 ms
Keep in mind that this is one of the best cases as t_1_40 has no NULLs
and only has fixed-width columns. The only slightly better case would
be to add more columns and fetch only the final one. 40 doesn't seem
excessively unrealistic, to get an idea of the gains that someone
*could* see.
You can see that perf top report that tts_buffer_heap_getsomeattrs
dropped from taking 75.57% down to 64.24% with 0001-0005. Adding 0006
sees that replaced with tts_buffer_heap_selectattrs which takes less
than 6% of the CPU time. It also highlights the next most interesting
thing we should probably make faster, heapgettup_pagemode().
I've attached v12 of the patch. There are a few changes in 0001-0005
that should help make things a bit faster than v11. I've also attached
the new selective deforming code in 0006. There's no JIT support for
0006 yet, I don't need to be told about that :-)
I'm planning on starting to go through 0002-0005 in much more detail
from mid next week with my committer hat on. If anyone wants to relook
at any of the 0002-0005 patches, there's still time. I'm also happy to
receive feedback on 0006, but I will address concerns with that at a
lower priority. One thing that's still left todo in the 0004 patch is
enable the TTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS optimisation for a few
other scan types.
Thanks for reading
David
[1] https://postgr.es/m/CAApHDvo1i-ycAcWnK3L7ZASTuM8mW46kvRqMaUHD46HSuJmx7A@mail.gmail.com
| Attachment | Content-Type | Size |
|---|---|---|
| v12-0001-Introduce-deform_bench-test-module.patch | text/plain | 7.3 KB |
| v12-0002-Allow-sibling-call-optimization-in-slot_getsomea.patch | text/plain | 7.3 KB |
| v12-0003-Add-empty-TupleDescFinalize-function.patch | text/plain | 29.0 KB |
| v12-0004-Optimize-tuple-deformation.patch | text/plain | 81.3 KB |
| v12-0005-Reduce-size-of-CompactAttribute-struct-to-8-byte.patch | text/plain | 5.7 KB |
| v12-0006-WIP-Introduce-selective-tuple-deforming.patch | text/plain | 42.0 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Henson Choi | 2026-03-06 04:38:30 | Re: Row pattern recognition |
| Previous Message | Michael Paquier | 2026-03-06 04:00:25 | Re: Add pg_stat_recovery system view |