Re: More speedups for tuple deformation

From: Andres Freund <andres(at)anarazel(dot)de>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: More speedups for tuple deformation
Date: 2026-01-28 16:26:51
Message-ID: p2oo6g2fg5nxbsspbyeauxhbyyi3zsx3tt4prkgju3j5qjo7lf@rjlhxusxzam2
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-01-28 02:34:26 +1300, David Rowley wrote:
> On Sat, 24 Jan 2026 at 05:33, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I wonder if it's worth writing a C helper to test deformation in a bit more
> > targeted way.
>
> Good idea. I've written a test module called "deform_bench". You can
> do: "select deform_bench('tablename'::regclass, '{10,20}');" which
> will deform up to attnum=10, then in a 2nd pass deform up to
> attnum=20. This is in the 0003 patch. (Requires "ninja
> install-test-files"). 0003 is intended for testing, not commit.

Nice! I am trying very hard to restrain myself from playing with it right
now, because I really need to get some other things done first...

> /*
> * slot_deform_heap_tuple
> * Given a TupleTableSlot, extract data from the slot's physical tuple
> @@ -1122,78 +1010,140 @@ static pg_attribute_always_inline void
> slot_deform_heap_tuple(TupleTableSlot *slot, HeapTuple tuple, uint32 *offp,
> int natts)
> {
> + CompactAttribute *cattr;
> + TupleDesc tupleDesc = slot->tts_tupleDescriptor;
> bool hasnulls = HeapTupleHasNulls(tuple);
> + HeapTupleHeader tup = tuple->t_data;
> + bits8 *bp; /* ptr to null bitmap in tuple */
> int attnum;
> + int firstNonCacheOffsetAttr;
> + int firstNullAttr;
> + Datum *values;
> + bool *isnull;
> + char *tp; /* ptr to tuple data */
> uint32 off; /* offset in tuple data */
> - bool slow; /* can we use/set attcacheoff? */
> +
> + /* Did someone forget to call TupleDescFinalize()? */
> + Assert(tupleDesc->firstNonCachedOffAttr >= 0);
>
> /* We can only fetch as many attributes as the tuple has. */
> - natts = Min(HeapTupleHeaderGetNatts(tuple->t_data), natts);
> + natts = Min(HeapTupleHeaderGetNatts(tup), natts);
> + attnum = slot->tts_nvalid;
> + firstNonCacheOffsetAttr = Min(tupleDesc->firstNonCachedOffAttr, natts);

FWIW, in a few experiments on my cascade lake systems, this branch (well, it
ends up as a cmov) ends up causing a surprisingly large performance
bottleneck. I don't really see a way around that, but I thought I'd mention it.

On the topic of tupleDesc->firstNonCachedOffAttr - shouldn't that be an
AttrNumber? Not that it'll make a difference perf or space wise, just for
clarity.

Hm, I guess natts isn't an AttrNumber either. Not sure why?

> + if (hasnulls)
> + {
> + bp = tup->t_bits;
> + firstNullAttr = first_null_attr(bp, natts);
> + firstNonCacheOffsetAttr = Min(firstNonCacheOffsetAttr, firstNullAttr);
> + }
> + else
> + {
> + bp = NULL;
> + firstNullAttr = natts;
> + }
> +
> + values = slot->tts_values;
> + isnull = slot->tts_isnull;
> + tp = (char *) tup + tup->t_hoff;

Another stall I see is due to the t_hoff computation - which makes sense, it's
in the tuple header and none of the deforming can happen without knowing the
address. I think in the !hasnulls case, the only influence on it is
MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)), so we could just hardcode
that?

Separately, sometimes - I haven't figured out when - gcc seems to think it's
smart to actually compute the `tp + cattr->attcacheoff` below using tup and
tup->t_hoff stored in registers (i.e. doing multiple adds). When the code is
generated that way, I see substantially worse performance. Have you seen
that?

> + else if (attnum == 0)
> {
> /* Start from the first attribute */
> off = 0;
> - slow = false;
> }
> else
> {
> /* Restore state from previous execution */
> off = *offp;
> - slow = TTS_SLOW(slot);
> }

Do we actually need both of these branches? Shouldn't *offp be set to 0 in the
attnum == 0 case?

> - if (!slow)
> + for (; attnum < firstNullAttr; attnum++)
> {
> [...]
> + cattr = TupleDescCompactAttr(tupleDesc, attnum);
> +
> + /* align the offset for this attribute */
> + off = att_pointer_alignby(off,
> + cattr->attalignby,
> + cattr->attlen,
> + tp + off);
> +
> + values[attnum] = fetchatt(cattr, tp + off);
> + isnull[attnum] = false;
> +
> + /* move the offset beyond this attribute */
> + off = att_addlength_pointer(off, cattr->attlen, tp + off);
> }

A few thoughts / suggestions:

1) We should not update values[]/isnull[] between fetchatt() and
att_addlength_pointer(). The compiler can't figure out that no fields in
cattr or *(tp + off) are being affected by those stores.

Changing this on master improves performance quite noticeably. I see a 13%
improvement in a test with deforming 5 not-null byval columns.

2) I sometime see performance benefits due to moving the isnull[attnum] =
false; to the beginning of the loop. Which makes some sense, starting the
store earlier allows it to complete earlier, and it doesn't depend on
fetching cattr, aligning the pointer, fetching the attribute and adjusting
the offset.

3) I briefly experimented with this code, and I think we may be able to
optimize the combination of att_pointer_alignby(), fetch_att() and
att_addlength_pointer(). They all do quite related work, and for byvalue
types, we know at compile time what the alignment requirement for each of
the supported attlen is.

> + /*
> + * Now handle any remaining tuples, this time include NULL checks as we're
> + * now at the first NULL attribute.
> + */
> + for (; attnum < natts; attnum++)
> {
> - /* XXX is it worth adding a separate call when hasnulls is false? */
> - attnum = slot_deform_heap_tuple_internal(slot,
> - tuple,
> - attnum,
> - natts,
> - true, /* slow */
> - hasnulls,
> - &off,
> - &slow);
> + if (att_isnull(attnum, bp))
> + {
> + values[attnum] = (Datum) 0;
> + isnull[attnum] = true;
> + continue;
> + }

Have you experimented setting isnull[] in a dedicated loop if there are nulls
and then in this loop just checking isnull[attnum]? Seems like that could
perhaps be combined with the work in first_null_attr() and be more efficient
than doing an att_isnull() separately for each column.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2026-01-28 16:53:59 Re: pgsql: Prevent invalidation of newly synced replication slots.
Previous Message Álvaro Herrera 2026-01-28 16:23:39 Re: Custom oauth validator options