Re: More speedups for tuple deformation

From: Andres Freund <andres(at)anarazel(dot)de>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: More speedups for tuple deformation
Date: 2026-02-03 00:33:27
Message-ID: v6z545yozjtywghn5glujemu72z4i4ynadsc2xks4ejotdg7yl@4rry7ixwr4us
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-02-01 00:27:02 +1300, David Rowley wrote:
> On Sat, 31 Jan 2026 at 06:11, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > This is why I like the idea of keeping track of whether we can rely on NOT
> > NULL columns to be present (I think that means we're evaluating expressions
> > other than constraint checks for new rows). It allows the leading NOT NULL
> > fixed-width columns to be decoded without having to wait for a good chunk of
> > the computations above. That's a performance boon even if we later have
> > nullable or varlength columns.
>
> I can look into this. As we both know, we can't apply this
> optimisation in every case as there are places in the code which form
> then deform tuples before NOT NULL constraints are checked.

Right.

> Perhaps the slot can store a flag to mention if the optimisation is valid to
> apply or not. It doesn't look like the flag can be part of the TupleDesc
> since we cache those in relcache.

I wonder if we should do it the other way round - use a special flag (and
perhaps tuple descriptor) iff we are evaluating "unsanitizes" tuples,
i.e. ones where the NOT NULLness might not yet be correct.

> I'm imagining that TupleDescFinalize() calculates another field which could
> be the max cached offset that's got a NOT NULL constraint and isn't
> attmissing. I think this will need another dedicated loop in
> slot_deform_heap_tuple() to loop up to that attribute before doing the
> firstNonCacheOffsetAttr loop.

I was imagining that we'd use the new value to enter the
firstNonCacheOffsetAttr loop without having to depend on
HeapTupleHeaderGetNatts() & HeapTupleHasNulls(). I.e. just use it to avoid the
dependency on having to have completed the memory fetch for the header.

> > > > Have you experimented setting isnull[] in a dedicated loop if there are nulls
> > > > and then in this loop just checking isnull[attnum]? Seems like that could
> > > > perhaps be combined with the work in first_null_attr() and be more efficient
> > > > than doing an att_isnull() separately for each column.
> > >
> > > Yes. I experiment with that quite a bit. I wasn't able to make it any
> > > faster than setting the isnull element in the same loop as the
> > > tts_values element. What I did try was having a dedicated tight loop
> > > like; for (int i = attnum; i < firstNullAttr; i++) isnull[i] = false;,
> > > but the compiler would always try to optimise that into an inlined
> > > memset which would result in poorly performing code in cases with a
> > > small number of columns due to the size and alignment prechecks.
> >
> > Yea, that kind of transformation is pretty annoying and makes little sense
> > here :(.
> >
> > I was thinking of actually computing the value of isnull[] based on the null
> > bitmap (as you also try below).
>
> I've taken the code you posted in [1] to do this. Thanks for that. It
> works very well.

Nice!

> I made it so the tts_isnull array size is rounded up to the next multiple of
> 8.

Right, that's what I assumed we'd need.

> I've attached 3 graphs, which are now looking a bit better. The gcc
> results are not quite as good. There's still a small regression with 0
> extra column test, and overall, the results are not as impressive as
> clang's. I've not yet studied why.

I suspect it's due to gcc thinking it'd be a good idea to vectorize the
loop. I saw that happening on godbolt.

Are your results better if you use

#if defined(__clang__)
#define pg_nounroll _Pragma("clang loop unroll(disable)")
#define pg_novector _Pragma("clang loop vectorize(disable)")
#elif defined(__GNUC__)
#define pg_nounroll _Pragma("GCC unroll 0")
#define pg_novector _Pragma("GCC novector")
#else
#define pg_nounroll
#define pg_novector _Pragma("loop( no_vector )")
#endif

and put "pg_nounroll pg_novector" before the loop in populate_isnull_array()?
That improves both gcc and clang code generation substantially for me, but
with a considerably bigger improvement for gcc.

Compiler Opt Isns
gcc -O2 165
clang -O2 135
gcc -O3 532
clang -O3 135

Preventing vectorization & unrolling:
gcc -O2 26
clang -O2 25
gcc -O3 26
clang -O3 25

It's somewhat scary to see this level of code size increase in a case where
the compiler really has no information to think vectorizing really is
beneficial...

I'd expect it makes sense to combine the loop for first_null_attr() with the
one for populate_isnull_array(). It might also prevent gcc from trying to
vectorize the loop...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2026-02-03 00:42:25 Re: Change default of jit to off
Previous Message Paul A Jungwirth 2026-02-03 00:19:34 Re: CREATE OR REPLACE MATERIALIZED VIEW