Quick Links

Re: unnecessary executor overheads around seqscans

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	pgsql-hackers(at)postgresql(dot)org, Amit Langote <amitlangote09(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>
Subject:	Re: unnecessary executor overheads around seqscans
Date:	2026-01-27 17:02:59
Message-ID:	lp4e5zipwolqcmnxvvlpx2ylmkjgip7cvbaud55dqdydfobupx@itmbp24b4ue2
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2026-01-27 12:16:15 +0200, Heikki Linnakangas wrote:
> On 27/01/2026 01:25, Andres Freund wrote:
> > On 2026-01-26 17:23:16 +0200, Heikki Linnakangas wrote:
> > > Perhaps we should turn the ExecScanExtended() function inside out. Instead
> > > of passing SeqNext as a callback to ExecScanExtended(), we would have a
> > > function like this (for illustration purposes only, doesn't compile):
> >
> > That would be one approach, would require structural changes in a fair number
> > of places though :/.
>
> ExecScanExtended is only used in execScan.c and nodeSeqScan.c, so not that
> many places. Even replacing ExecScan() completely would be isolated to
> src/backend/executor.

I was assuming we'd not want a totally different structure for nodeSeqScan.c
than for the rest. I think we really ought to use ExecScanExtended
everywhere. Probably not with multiple callsites (for no qual, no projection
etc), but to get rid of the indirect function calls to "accessMtd".

> > What shows up more visibly afterwards is that we set ExecScanExtended() sets
> > econtext->ecxt_scantuple in every iteration, despite that not changing in
> > almost all cases (I think for FDWs it could, fdwhandler.sgml just says that
> > the scanslot "should" be used).
>
> Huh, that's just a single store instruction. This really is a hot path. Or
> is it just that the first instruction after the function call causes some
> kind of a stall or data dependency, i.e. if that was removed, it'd just move
> to the next instruction instead?

I was a bit surprised too. It might be a case of some store forwarding issue,
because very shortly afterwards we end up reading ecxt_scantuple (at the start
of expression evaluation). Or we're just hitting limits on the number of
in-flight stores.

> I wonder about the access to (econtext)->ecxt_per_tuple_memory. That never
> changes, but we're re-fetching that too on every iteration.

The load itself didn't show up crazily, but there's something related that
does show up quite prominently: ResetExprContext(). I briefly hacked upan
inline version of MemoryContextReset(), and that does help noticeably.

Another context related thing that does show up is all the switcheroo around
CurrentMemoryContext.

> InstrCountFiltered1() also fetches node->instrument, with a NULL check, on
> every iteration. Maybe keep a counter in a local variable and only call
> InstrCountFiltered1() when exiting the loop.

> I guess these don't show up in a profiler or you would've latched on them
> already..

It does show up, but only really after addressing the other things... There
are plenty more things that show up, fwiw:

- Expression evaluation checks for NULL function args, even though we could
know that they should not be NULL (this is complicated by the fact that
sometimes we build "illegal" tuples, before constraint checking). That shows
up surprisingly high.

I think if we added information about whether NOT NULL is reliable in
tupledescs it could solve this.

While it'd allow optimizing some cases, it'd still be limited due strict
function calls not being guaranteed to return NOT NULL.

- heap_getnextslot()'s calls to heapgettup_pagemode() should be inlined

- we probably should eventually inline ExecStoreBufferHeapTuple(),
that way we would only need to handle the "buffer has changed" paths when
heapgettup_pagemode() switches pages

- Expression evaluation has a too high startup overhead. This is partially due
to it preparing to access all types of slots, which most of the time aren't
needed.

This would be a lot easier if we didn't need to go through the indirection
via ExprContext to find the right scan/outer/... slot during every
evaluation. But unfortunately there are some callsites that do change the
slot between executions. Perhaps we could make it a per-expression
operation to change the slot of an expression, which would update state
inside the ExprState?

- The access to columns in slots is too expensive during expression
evaluation. We go from the slot, via an indirect memory access, to the
values array, then have to fetch the right element there. Then do the same
for nulls.

I think to really do better we'd need to do two things:

- Use one NullableDatum array instead of two separate arrays for values and
isnull.

- Store the array in place in the slot, as a flexible array member). That
makes storing of per-slot-type data harder though. We probably would need
to embed the "common" part of the slot at the end of the custom part of
the slot, which is more complicated...

That's a lot of work, I tried it before... There's a fair bit of fallout due
to all the places that do stuff like forming tuples based on
tts_values/tts_isnull.

I do wonder if it could already help to just store the offset to the values
and isnull arrays inside the slot, instead of having to read pointers and
then do address calculation based on those.

- The function call interface is way too expensive for simple operators

We need a simplified function call interface that doesn't shuffle everything
through pointers...

Greetings,

Andres Freund

In response to

Re: unnecessary executor overheads around seqscans at 2026-01-27 10:16:15 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2026-01-27 17:11:23	Re: pgsql: Prevent invalidation of newly synced replication slots.
Previous Message	Sami Imseih	2026-01-27 16:53:41	Re: Flush some statistics within running transactions