Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Radim Marek <radim(at)boringsql(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS
Date: 2026-02-03 10:57:01
Message-ID: f035dbd6-e8a7-4660-a54d-75a01f43fc32@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/2/26 23:54, Radim Marek wrote:
> Hello,
>
> First let me say this is a first time bug report and trying to patch the
> PostgreSQL, fresh after the hacking webinar. Please let me know if I
> need to clarify something.
>
> TLDR; EXPLAIN ANALYZE BUFFERS reports different buffer counts for the
> Sort node between first and subsequent executions per session.
>
>          Buffers: shared hit=4
>          ->  Seq Scan on sort_buffer_test  (cost=0.00..2.50 rows=1
> width=32) (actual time=0.042..0.126 rows=1.00 loops=1)
>                Filter: (lower(email) = 'email_1(at)example(dot)com
> <mailto:email_1(at)example(dot)com>'::text)
>                Rows Removed by Filter: 99
>                Buffers: shared hit=1
>
> vs
>
>          Buffers: shared hit=1
>          ->  Seq Scan on sort_buffer_test  (cost=0.00..2.50 rows=1
> width=32) (actual time=0.021..0.057 rows=1.00 loops=1)
>                Filter: (lower(email) = 'email_1(at)example(dot)com
> <mailto:email_1(at)example(dot)com>'::text)
>                Rows Removed by Filter: 99
>                Buffers: shared hit=1
>
> I found it while working on SQL regression tool (regresql) that uses
> total reported buffers from EXPLAIN ANALYZE BUFFERS to detect deviations
> for query performance.
>
> The issue came on smaller queries where test cases showed consistent
> "improvement" on performance (fewer buffers) when comparing baseline and
> actual test runs. Previously did not notice it on larger data sets due
> to the difference in buffers being negligible given the larger operations.
>
> The example is the test case I managed to isolate.
>
> --- way how to reproduce
>
>   CREATE TABLE sort_buffer_test (id serial PRIMARY KEY, val int);
>   INSERT INTO sort_buffer_test (val) SELECT generate_series(1, 100);
>   ANALYZE sort_buffer_test;
>
>   -- run twice in new psql session and observe reported buffer change
>   EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMING OFF)
>   SELECT * FROM sort_buffer_test ORDER BY val DESC LIMIT 1;
>
> ---
>
> This behaviour is session specific, tested on 17.x, 18.x and current
> development version (just fiy, all verified on both amd64 and aarch64).
>
> I traced the potential issue down in case of ORDER BY to pg_amop/
> pg_amproc lookups. The specific lookups vary by operation but the
> pattern is the same: first execution incurs catalog reads that get
> cached for subsequent runs. This applies to DISTINCT (sort, not hashed),
> GROUP BY, window functions, etc - which seems to isolate it to sort node.
>
> All those I believe are actually metadata resolution (i.e. planner
> already knows what to sort by).
>

Correct. AFAICS this happens because the execution may need additional
metadata that was not needed for planning. In this particular case the
planning checks that the operators are valid/compatible, but does not
lookup all the associated pg_amproc entries. That only happens during
execution, when actually constructing the "tuplesort" object.

But there are probably many similar cases.

> Having said that - I'm happy to try to fix it. Just wondering what is
> the right direction to go:
>
> 1. Pre-warm syscache during planning (tried locally and it works - but
> does not feel conceptually right and it's rather 'hacky')

Yeah, not going to happen. We'd have to preload the whole syscache,
because during planning we don't even know if that particular path/plan
will win. So we only do the bare minimum, to minimize the cost of
planning. And then we initialize the additional fields later, once we
have the cheapest plan.

> 2. Pre-compute and store metadata and modify each plan node struct for
> each type (lots of work). I believe that's what MergeJoin already does.

I suppose MergeJoin may simply need more metadata during planning. If it
does more than that, it might count as a bug (and it should do less).

> 3. Or this is 'works as expected' - which I don't think it's the case
> and would force me to do pre-warming on regresql side
>

I think it mostly works as expected/designed. That however does not mean
the issue you pointed out does not exist. IMHO what regresql is aiming
to do would be quite handy - e.g. in the index prefetching patch we have
regression tests checking that we're not accessing unexpected number of
buffers (or more buffers than master).

I'm planning to check if regresql would be a more elegant way to do
that. For now we're relying on plain regression tests (SQL + expected
output), and that happens to not have this issue because it executes the
queries in a session with a "deterministic" state.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nazir Bilal Yavuz 2026-02-03 10:59:52 Re: Speed up COPY FROM text/CSV parsing using SIMD
Previous Message Christoph Berg 2026-02-03 10:29:13 Re: Change default of jit to off