Re: More speedups for tuple deformation

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: More speedups for tuple deformation
Date: 2026-01-30 11:10:42
Message-ID: CAApHDvpbntG7V3_EsZ+w-V=jU-y8rFmv9RB1EDJm4sxKno-4UA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for looking at this again.

On Thu, 29 Jan 2026 at 05:26, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2026-01-28 02:34:26 +1300, David Rowley wrote:
> > + firstNonCacheOffsetAttr = Min(tupleDesc->firstNonCachedOffAttr, natts);
>
> FWIW, in a few experiments on my cascade lake systems, this branch (well, it
> ends up as a cmov) ends up causing a surprisingly large performance
> bottleneck. I don't really see a way around that, but I thought I'd mention it.

Yeah, I believe this is the primary reason that I'm fighting the small
regression on the 0 extra column test. I thought it might be because
the mov has a dependency and wait on natts being calculated, which
needs to access fields in the tuple header. I wonder if there's some
reason the compiler to CPU can't defer calculating
firstNonCacheOffsetAttr until later. Maybe I should try moving it
later in the code to see if that helps.

> On the topic of tupleDesc->firstNonCachedOffAttr - shouldn't that be an
> AttrNumber? Not that it'll make a difference perf or space wise, just for
> clarity.
>
> Hm, I guess natts isn't an AttrNumber either. Not sure why?

I noticed that too, but took the path of least resistance and made
firstNonCachedOffAttr an int too. I did wonder why natts wasn't an
AttrNumber. If they both were AttrNumbers, I wouldn't need to make the
TupleDesc struct bigger. Right now, I've enlarged it by 8 bytes by
adding firstNonCachedOffAttr.

One problem is that a bunch of functions that accept int;
CreateTemplateTupleDesc(int natts), CreateTupleDesc(int natts,
Form_pg_attribute *attrs). Then BuildDescFromLists() sets natts based
on list_length(). Maybe CreateTemplateTupleDesc() could Assert or
throw an error if natts does not fit in 16-bits.

>
> > + if (hasnulls)
> > + {
> > + bp = tup->t_bits;
> > + firstNullAttr = first_null_attr(bp, natts);
> > + firstNonCacheOffsetAttr = Min(firstNonCacheOffsetAttr, firstNullAttr);
> > + }
> > + else
> > + {
> > + bp = NULL;
> > + firstNullAttr = natts;
> > + }
> > +
> > + values = slot->tts_values;
> > + isnull = slot->tts_isnull;
> > + tp = (char *) tup + tup->t_hoff;
>
> Another stall I see is due to the t_hoff computation - which makes sense, it's
> in the tuple header and none of the deforming can happen without knowing the
> address. I think in the !hasnulls case, the only influence on it is
> MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)), so we could just hardcode
> that?

hmm. I wonder why it even needs to exist. If the null bitmap is there,
you can calculate how many bytes from natts. I tried doing "tp = (char
*) tup + MAXALIGN(offsetof(HeapTupleHeaderData, t_bits));" for the
!hasnulls case and it's hard to tell if it helps. See the 0004 patch.
I'm somewhat hesitant to go against the grain here on how to calculate
where the tuple data starts.

> Separately, sometimes - I haven't figured out when - gcc seems to think it's
> smart to actually compute the `tp + cattr->attcacheoff` below using tup and
> tup->t_hoff stored in registers (i.e. doing multiple adds). When the code is
> generated that way, I see substantially worse performance. Have you seen
> that?

I've not noticed that.

> > + else if (attnum == 0)
> > {
> > /* Start from the first attribute */
> > off = 0;
> > - slow = false;
> > }
> > else
> > {
> > /* Restore state from previous execution */
> > off = *offp;
> > - slow = TTS_SLOW(slot);
> > }
>
> Do we actually need both of these branches? Shouldn't *offp be set to 0 in the
> attnum == 0 case?

I see tts_heap_clear() zeros it, so I think it should be ok. Doesn't
feel quite as robust, however.

> A few thoughts / suggestions:
>
> 1) We should not update values[]/isnull[] between fetchatt() and
> att_addlength_pointer(). The compiler can't figure out that no fields in
> cattr or *(tp + off) are being affected by those stores.
>
> Changing this on master improves performance quite noticeably. I see a 13%
> improvement in a test with deforming 5 not-null byval columns.

That's easy enough. I've moved it up in the v7 patch before the cattr
assignment.

> 2) I sometime see performance benefits due to moving the isnull[attnum] =
> false; to the beginning of the loop. Which makes some sense, starting the
> store earlier allows it to complete earlier, and it doesn't depend on
> fetching cattr, aligning the pointer, fetching the attribute and adjusting
> the offset.

> 3) I briefly experimented with this code, and I think we may be able to
> optimize the combination of att_pointer_alignby(), fetch_att() and
> att_addlength_pointer(). They all do quite related work, and for byvalue
> types, we know at compile time what the alignment requirement for each of
> the supported attlen is.

Is this true? Isn't there some nearby discussion about AIX having
4-byte double alignment?

I've taken a go at implementing a function called
align_fetch_then_add(), which rolls all the macros into one (See
0004). I just can't see any improvements with it. Maybe I've missed
something that could be more optimal. I did even ditch one of the
cases from the switch(attlen). It might be ok to do that now as we can
check for invalid attlens for byval types when we populate the
CompactAttribute.

> Have you experimented setting isnull[] in a dedicated loop if there are nulls
> and then in this loop just checking isnull[attnum]? Seems like that could
> perhaps be combined with the work in first_null_attr() and be more efficient
> than doing an att_isnull() separately for each column.

Yes. I experiment with that quite a bit. I wasn't able to make it any
faster than setting the isnull element in the same loop as the
tts_values element. What I did try was having a dedicated tight loop
like; for (int i = attnum; i < firstNullAttr; i++) isnull[i] = false;,
but the compiler would always try to optimise that into an inlined
memset which would result in poorly performing code in cases with a
small number of columns due to the size and alignment prechecks. I had
given up on it as I was already fighting some performance regressions
for the 0 extra column test and this made those worse. However...

In the attached 0004 patch I've experimented with this again. This
time, I wrote a function that converts the null bitmap into the isnull
array using a lookup table. I spent a bit of time trying to figure out
a way to do this without the lookup table and only came up with a
method that requires AVX512 instructions. I coded that up, but it
requires building with -march=x86-64-v4, which will likely cause many
other reasons for the performance to vary.

The machine that likes 0004 the most (using the lookup table method of
setting the isnull array) is the Apple M2. All the tests apart from
the 0 extra column test became 30-90% faster. Previously the tests
that had to do att_isnull didn't improve very much. The 0 extra column
test regressed quite a bit. 50% slower on all but test 1 and 5 (the
ones without NULLs). See the attached graph. The Zen2 machine also
perhaps quite likes it, but not for the 0 extra column test. I'm
struggling to get stable performance results from that machine right
now. My Zen 4 laptop isn't a fan of it, but also not getting very
stable performance results from that either.

I'm curious to see what your Intel machines think of 0004 vs not having it.

Right now, I'm really only getting stable performance out of my Apple
M2 machine, so I'm not too sure what parts of 0004 I should include in
0002 and which ones I should throw away.

David

Attachment Content-Type Size
v7-0001-Add-empty-TupleDescFinalize-function.patch text/plain 29.0 KB
m2_on_v7_with_0004.gif image/gif 80.0 KB
v7-0002-Precalculate-CompactAttribute-s-attcacheoff.patch text/plain 50.0 KB
v7-0003-Introduce-deform_bench-test-module.patch text/plain 7.3 KB
v7-0004-Various-experimental-changes.patch text/plain 13.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2026-01-30 11:16:26 Re: Consistently use the XLogRecPtrIsInvalid() macro
Previous Message Andrey Borodin 2026-01-30 11:09:34 Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)