Re: More speedups for tuple deformation

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: More speedups for tuple deformation
Date: 2026-01-27 13:34:26
Message-ID: CAApHDvo1i-ycAcWnK3L7ZASTuM8mW46kvRqMaUHD46HSuJmx7A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 24 Jan 2026 at 05:33, Andres Freund <andres(at)anarazel(dot)de> wrote:
> I wonder if it's worth writing a C helper to test deformation in a bit more
> targeted way.

Good idea. I've written a test module called "deform_bench". You can
do: "select deform_bench('tablename'::regclass, '{10,20}');" which
will deform up to attnum=10, then in a 2nd pass deform up to
attnum=20. This is in the 0003 patch. (Requires "ninja
install-test-files"). 0003 is intended for testing, not commit.

There are also 2 scripts attached, one which sets up all the tables
for the benchmark, and one to run it. This saves creating the same
tables again when trying other branches or compilers.

I've also included a slightly revised patch. I made a small change to
the first_null_attr() to get rid of the masking of higher attnums and
also now making use of __builtin_ctz to find the first NULL attnum in
the byte. For compilers that don't support that, I've included a
pg_rightmost_*zero*_pos table. I didn't want to use the pg_bitutils
table for the rightmost *one* pos as it meant having to special-case
what happens when using index 255, as that would return 0, and I want
8. I'll make the MSVC version use _BitScanForward() in the next patch.
Using __builtin_ctz() seems to help reduce the small regression I was
seeing with the 0 extra column test. It's still there, but it is very
small. It's more pronounced because of the deform_bench module due to
the reduction of the other execution overheads.

Technically, the first_null_attr() function *could* contain slightly
fewer checks. It should be guaranteed that we'll find a byte not set
to 255, as there wouldn't be a bitmask there if there were no 0s. So
technically, the first for loop could be a while (byte[bytenum] ==
0xFF) bytenum++;. I just felt that might be too dangerous to do that
as the code would walk off the end of the bitmask if the tuple was
corrupted in the right way.

With the reduced overhead using deform_bench, the Apple M2 results are
looking quite good. Test 5 with 20 extra columns is 128% faster than
master and averages ~25% faster than master over all tests. My results
are in the attached spreadsheet.

David

Attachment Content-Type Size
deform_test_setup.sh.txt text/plain 1.4 KB
deform_test_run.sh.txt text/plain 1.4 KB
v6-0001-Add-empty-TupleDescFinalize-function.patch text/plain 29.0 KB
v6-0002-Precalculate-CompactAttribute-s-attcacheoff.patch text/plain 49.9 KB
v6-0003-Introduce-deform_bench-test-module.patch text/plain 7.2 KB
Deform_bench_test_module_results_2026-01-28.xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 29.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2026-01-27 13:51:01 Re: display hot standby state in psql prompt
Previous Message Zsolt Parragi 2026-01-27 12:56:50 Re: Add GoAway protocol message for graceful but fast server shutdown/switchover