Re: More speedups for tuple deformation

From: Andres Freund <andres(at)anarazel(dot)de>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: More speedups for tuple deformation
Date: 2026-01-30 17:11:44
Message-ID: e7sto7tk5dk5hfyvoocaddnxcngemcmfvbuh23l32w5cssaizy@znuphjqug7qe
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-01-31 00:10:42 +1300, David Rowley wrote:
> Thank you for looking at this again.
>
> On Thu, 29 Jan 2026 at 05:26, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2026-01-28 02:34:26 +1300, David Rowley wrote:
> > > + firstNonCacheOffsetAttr = Min(tupleDesc->firstNonCachedOffAttr, natts);
> >
> > FWIW, in a few experiments on my cascade lake systems, this branch (well, it
> > ends up as a cmov) ends up causing a surprisingly large performance
> > bottleneck. I don't really see a way around that, but I thought I'd mention it.
>
> Yeah, I believe this is the primary reason that I'm fighting the small
> regression on the 0 extra column test. I thought it might be because
> the mov has a dependency and wait on natts being calculated, which
> needs to access fields in the tuple header.

I agree that it's related to that. You have a chain of multiple computations
that lead to a higher aggregate latency.

natts depends on HeapTupleHeaderGetNatts(tup), firstNonCachedOffsetAttr
depends on both tupleDesc->firstNonCachedOffAttr and natts.

Afaict the dependencies are like the following:

0) a memory load (for tuple->t_data), likely to be cached
1) a memory load (for HeapTupleHeaderGetNatts(tup)), likely to miss memory,
depends on 0)
2) some bit-shiftery to compute natts, depends on 1)
3) a conditional move (the Min()), depends on 3)
4) a memory load (for slot->tupdesc), no dependencies, likely cached
5) a memory load (for tupdesc->firstNonCachedOfAttr), depends on 4), cached
6) a conditional move (the Min()), depends on 3) and 5)
7) a memory load (for HeapTupleHasNulls()), likely to miss memory, but can
piggy back on the cacheline load from 1), depends on 0)
8) a conditional branch (if (hasnulls), depends on 7)

And that's without even taking the hasnulls == true case into account.

Before all of those are completely executed ("retired"), none of the
speculatively executed instructions, e.g. the contents of the if (attnum <
firstNonCacheOffsetAttr) branch, can be retired.

This is why I like the idea of keeping track of whether we can rely on NOT
NULL columns to be present (I think that means we're evaluating expressions
other than constraint checks for new rows). It allows the leading NOT NULL
fixed-width columns to be decoded without having to wait for a good chunk of
the computations above. That's a performance boon even if we later have
nullable or varlength columns.

> I wonder if there's some reason the compiler to CPU can't defer calculating
> firstNonCacheOffsetAttr until later. Maybe I should try moving it later in
> the code to see if that helps.

I think it's the opposite, you want to have it as early as possible so it has
the lowest latency impact by the time the value is required, by starting the
first elements in the dependency chain before the rest.

> > On the topic of tupleDesc->firstNonCachedOffAttr - shouldn't that be an
> > AttrNumber? Not that it'll make a difference perf or space wise, just for
> > clarity.
> >
> > Hm, I guess natts isn't an AttrNumber either. Not sure why?
>
> I noticed that too, but took the path of least resistance and made
> firstNonCachedOffAttr an int too.

Probably reasonable.

> I did wonder why natts wasn't an AttrNumber. If they both were AttrNumbers,
> I wouldn't need to make the TupleDesc struct bigger. Right now, I've
> enlarged it by 8 bytes by adding firstNonCachedOffAttr.

One related thing: I noticed a speedup a few days ago when making sure that
both the slot and the descriptor are aligned to a cacheline boundary,
presumably because it reduces the number of cachelines that need to be in L1
for good performance. But the results were somewhat inconsistent.

Separately I was seeing performance changes in cases where I shouldn't really
see then on the cascade lake system. After a while I noticed that the
differences I was seeing were due to differences in how effective the uop
cache is. A bunch of searching lead to information about the "jcc eratum",
which in turn can be mitigated via "-Wa,-mbranches-within-32B-boundaries". I
am not actually clear whether my CPU is directly affected by that eratum, but
nonetheless, it improved performance quite substantially.

> One problem is that a bunch of functions that accept int;
> CreateTemplateTupleDesc(int natts), CreateTupleDesc(int natts,
> Form_pg_attribute *attrs). Then BuildDescFromLists() sets natts based on
> list_length(). Maybe CreateTemplateTupleDesc() could Assert or throw an
> error if natts does not fit in 16-bits.

We'd probably fail with larger values anyway, so it might be a good idea to
detect that regardless of changing the argument type.

> >
> > > + if (hasnulls)
> > > + {
> > > + bp = tup->t_bits;
> > > + firstNullAttr = first_null_attr(bp, natts);
> > > + firstNonCacheOffsetAttr = Min(firstNonCacheOffsetAttr, firstNullAttr);
> > > + }
> > > + else
> > > + {
> > > + bp = NULL;
> > > + firstNullAttr = natts;
> > > + }
> > > +
> > > + values = slot->tts_values;
> > > + isnull = slot->tts_isnull;
> > > + tp = (char *) tup + tup->t_hoff;
> >
> > Another stall I see is due to the t_hoff computation - which makes sense, it's
> > in the tuple header and none of the deforming can happen without knowing the
> > address. I think in the !hasnulls case, the only influence on it is
> > MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)), so we could just hardcode
> > that?
>
> hmm. I wonder why it even needs to exist.

There are a lot of questions like that in our on-disk format. There's a lot of
decisions in there that make fast decoding way harder than it has to be...

> If the null bitmap is there,
> you can calculate how many bytes from natts. I tried doing "tp = (char
> *) tup + MAXALIGN(offsetof(HeapTupleHeaderData, t_bits));" for the
> !hasnulls case and it's hard to tell if it helps. See the 0004 patch.

It helps here quite measurably interestingly.

> I'm somewhat hesitant to go against the grain here on how to calculate where
> the tuple data starts.

Hm. I don't think it's particularly risky. We could even add an error branch
for that assumption being violated - as long as the other computations don't
depend on the result of the conditional branch, the impact on performance
should be quite minimal.

> > > + else if (attnum == 0)
> > > {
> > > /* Start from the first attribute */
> > > off = 0;
> > > - slow = false;
> > > }
> > > else
> > > {
> > > /* Restore state from previous execution */
> > > off = *offp;
> > > - slow = TTS_SLOW(slot);
> > > }
> >
> > Do we actually need both of these branches? Shouldn't *offp be set to 0 in the
> > attnum == 0 case?
>
> I see tts_heap_clear() zeros it, so I think it should be ok. Doesn't
> feel quite as robust, however.

Should be easy enough to assert that it's correct. I don't think it's very
likely we'd just forget resetting it. It's already catastrophic if we set it
wrongly.

> > 3) I briefly experimented with this code, and I think we may be able to
> > optimize the combination of att_pointer_alignby(), fetch_att() and
> > att_addlength_pointer(). They all do quite related work, and for byvalue
> > types, we know at compile time what the alignment requirement for each of
> > the supported attlen is.
>
> Is this true? Isn't there some nearby discussion about AIX having
> 4-byte double alignment?

The AIX stuff is just bonkers. Having different alignment based on where in an
aggregate a double is is just insane.

We could probably just accomodate that with a compile-time ifdef.

> I've taken a go at implementing a function called align_fetch_then_add(),
> which rolls all the macros into one (See 0004). I just can't see any
> improvements with it. Maybe I've missed something that could be more
> optimal. I did even ditch one of the cases from the switch(attlen). It might
> be ok to do that now as we can check for invalid attlens for byval types
> when we populate the CompactAttribute.

You could move the TYPEALIGN(attalignby, *off) calls into the attlen switch
and call it with a constant argument.

> > Have you experimented setting isnull[] in a dedicated loop if there are nulls
> > and then in this loop just checking isnull[attnum]? Seems like that could
> > perhaps be combined with the work in first_null_attr() and be more efficient
> > than doing an att_isnull() separately for each column.
>
> Yes. I experiment with that quite a bit. I wasn't able to make it any
> faster than setting the isnull element in the same loop as the
> tts_values element. What I did try was having a dedicated tight loop
> like; for (int i = attnum; i < firstNullAttr; i++) isnull[i] = false;,
> but the compiler would always try to optimise that into an inlined
> memset which would result in poorly performing code in cases with a
> small number of columns due to the size and alignment prechecks.

Yea, that kind of transformation is pretty annoying and makes little sense
here :(.

I was thinking of actually computing the value of isnull[] based on the null
bitmap (as you also try below).

> In the attached 0004 patch I've experimented with this again. This
> time, I wrote a function that converts the null bitmap into the isnull
> array using a lookup table.

Oh. I was just thinking of something roughly like

int i nullbyte_i = attnum >> 3;
for (int nullcol = attnum; nullcol < natts; nullcol += 8)
{
bits8 nullbyte = bp[nullbyte_i++];

for (int onebyte = 0; onebyte < 8; onebyte++)
{
if (nullcol < natts)
tts_isnull[nullcol] = nullbyte & 0x01;
nullbyte >>= 1;
nullcol++;
}
}

This isn't quite right, as we'd need to deal with starting to deform at a
attribute that's not % 8 = 0 (I'd probably just do the whole byte even if we'd
redo a few column)). And probably lots of other stuff.

With a bit of care the inner loop should be unrollable with all the moves as
conditional moves depending on nullcol < natts. Or such.

Or we could just make sure tts_isnull is always sized to be divisible by 8,
that'd presumably allow considerably better code to be generated.

I'd hope this would be more efficient than doing

static inline bool
att_isnull(int ATT, const bits8 *BITS)
{
return !(BITS[ATT >> 3] & (1 << (ATT & 0x07)));
}

for each column.

> I spent a bit of time trying to figure out a way to do this without the
> lookup table and only came up with a method that requires AVX512
> instructions. I coded that up, but it requires building with
> -march=x86-64-v4, which will likely cause many other reasons for the
> performance to vary.

Yea, I doubt we want that... Too many tuples will be too short to benefit.

> The machine that likes 0004 the most (using the lookup table method of
> setting the isnull array) is the Apple M2. All the tests apart from
> the 0 extra column test became 30-90% faster. Previously the tests
> that had to do att_isnull didn't improve very much. The 0 extra column
> test regressed quite a bit. 50% slower on all but test 1 and 5 (the
> ones without NULLs). See the attached graph. The Zen2 machine also
> perhaps quite likes it, but not for the 0 extra column test. I'm
> struggling to get stable performance results from that machine right
> now. My Zen 4 laptop isn't a fan of it, but also not getting very
> stable performance results from that either.

Hm. A 2kB lookup table in the middle of deforming is probably not great for
L1...

> I'm curious to see what your Intel machines think of 0004 vs not having it.

Scheduling an experiment with it.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2026-01-30 17:20:01 Re: Refactor how we form HeapTuples for CatalogTuple(Insert|Update)
Previous Message Heikki Linnakangas 2026-01-30 16:27:57 Re: Improvements and refactoring in shmem.c