| From: | Andres Freund <andres(at)anarazel(dot)de> |
|---|---|
| To: | David Rowley <dgrowleyml(at)gmail(dot)com> |
| Cc: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: More speedups for tuple deformation |
| Date: | 2026-02-03 00:33:27 |
| Message-ID: | v6z545yozjtywghn5glujemu72z4i4ynadsc2xks4ejotdg7yl@4rry7ixwr4us |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On 2026-02-01 00:27:02 +1300, David Rowley wrote:
> On Sat, 31 Jan 2026 at 06:11, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > This is why I like the idea of keeping track of whether we can rely on NOT
> > NULL columns to be present (I think that means we're evaluating expressions
> > other than constraint checks for new rows). It allows the leading NOT NULL
> > fixed-width columns to be decoded without having to wait for a good chunk of
> > the computations above. That's a performance boon even if we later have
> > nullable or varlength columns.
>
> I can look into this. As we both know, we can't apply this
> optimisation in every case as there are places in the code which form
> then deform tuples before NOT NULL constraints are checked.
Right.
> Perhaps the slot can store a flag to mention if the optimisation is valid to
> apply or not. It doesn't look like the flag can be part of the TupleDesc
> since we cache those in relcache.
I wonder if we should do it the other way round - use a special flag (and
perhaps tuple descriptor) iff we are evaluating "unsanitizes" tuples,
i.e. ones where the NOT NULLness might not yet be correct.
> I'm imagining that TupleDescFinalize() calculates another field which could
> be the max cached offset that's got a NOT NULL constraint and isn't
> attmissing. I think this will need another dedicated loop in
> slot_deform_heap_tuple() to loop up to that attribute before doing the
> firstNonCacheOffsetAttr loop.
I was imagining that we'd use the new value to enter the
firstNonCacheOffsetAttr loop without having to depend on
HeapTupleHeaderGetNatts() & HeapTupleHasNulls(). I.e. just use it to avoid the
dependency on having to have completed the memory fetch for the header.
> > > > Have you experimented setting isnull[] in a dedicated loop if there are nulls
> > > > and then in this loop just checking isnull[attnum]? Seems like that could
> > > > perhaps be combined with the work in first_null_attr() and be more efficient
> > > > than doing an att_isnull() separately for each column.
> > >
> > > Yes. I experiment with that quite a bit. I wasn't able to make it any
> > > faster than setting the isnull element in the same loop as the
> > > tts_values element. What I did try was having a dedicated tight loop
> > > like; for (int i = attnum; i < firstNullAttr; i++) isnull[i] = false;,
> > > but the compiler would always try to optimise that into an inlined
> > > memset which would result in poorly performing code in cases with a
> > > small number of columns due to the size and alignment prechecks.
> >
> > Yea, that kind of transformation is pretty annoying and makes little sense
> > here :(.
> >
> > I was thinking of actually computing the value of isnull[] based on the null
> > bitmap (as you also try below).
>
> I've taken the code you posted in [1] to do this. Thanks for that. It
> works very well.
Nice!
> I made it so the tts_isnull array size is rounded up to the next multiple of
> 8.
Right, that's what I assumed we'd need.
> I've attached 3 graphs, which are now looking a bit better. The gcc
> results are not quite as good. There's still a small regression with 0
> extra column test, and overall, the results are not as impressive as
> clang's. I've not yet studied why.
I suspect it's due to gcc thinking it'd be a good idea to vectorize the
loop. I saw that happening on godbolt.
Are your results better if you use
#if defined(__clang__)
#define pg_nounroll _Pragma("clang loop unroll(disable)")
#define pg_novector _Pragma("clang loop vectorize(disable)")
#elif defined(__GNUC__)
#define pg_nounroll _Pragma("GCC unroll 0")
#define pg_novector _Pragma("GCC novector")
#else
#define pg_nounroll
#define pg_novector _Pragma("loop( no_vector )")
#endif
and put "pg_nounroll pg_novector" before the loop in populate_isnull_array()?
That improves both gcc and clang code generation substantially for me, but
with a considerably bigger improvement for gcc.
Compiler Opt Isns
gcc -O2 165
clang -O2 135
gcc -O3 532
clang -O3 135
Preventing vectorization & unrolling:
gcc -O2 26
clang -O2 25
gcc -O3 26
clang -O3 25
It's somewhat scary to see this level of code size increase in a case where
the compiler really has no information to think vectorizing really is
beneficial...
I'd expect it makes sense to combine the loop for first_null_attr() with the
one for populate_isnull_array(). It might also prevent gcc from trying to
vectorize the loop...
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Euler Taveira | 2026-02-03 00:42:25 | Re: Change default of jit to off |
| Previous Message | Paul A Jungwirth | 2026-02-03 00:19:34 | Re: CREATE OR REPLACE MATERIALIZED VIEW |