Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

From: Radim Marek <radim(at)boringsql(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS
Date: 2026-02-02 22:54:24
Message-ID: CAJgoLkLCCKy1hAFpoUzwaRq91cA=m=-J03MBpUo-d7-oXy6MzA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

First let me say this is a first time bug report and trying to patch the
PostgreSQL, fresh after the hacking webinar. Please let me know if I need
to clarify something.

TLDR; EXPLAIN ANALYZE BUFFERS reports different buffer counts for the Sort
node between first and subsequent executions per session.

Buffers: shared hit=4
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1
width=32) (actual time=0.042..0.126 rows=1.00 loops=1)
Filter: (lower(email) = 'email_1(at)example(dot)com'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1

vs

Buffers: shared hit=1
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1
width=32) (actual time=0.021..0.057 rows=1.00 loops=1)
Filter: (lower(email) = 'email_1(at)example(dot)com'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1

I found it while working on SQL regression tool (regresql) that uses total
reported buffers from EXPLAIN ANALYZE BUFFERS to detect deviations for
query performance.

The issue came on smaller queries where test cases showed consistent
"improvement" on performance (fewer buffers) when comparing baseline and
actual test runs. Previously did not notice it on larger data sets due to
the difference in buffers being negligible given the larger operations.

The example is the test case I managed to isolate.

--- way how to reproduce

CREATE TABLE sort_buffer_test (id serial PRIMARY KEY, val int);
INSERT INTO sort_buffer_test (val) SELECT generate_series(1, 100);
ANALYZE sort_buffer_test;

-- run twice in new psql session and observe reported buffer change
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMING OFF)
SELECT * FROM sort_buffer_test ORDER BY val DESC LIMIT 1;

---

This behaviour is session specific, tested on 17.x, 18.x and current
development version (just fiy, all verified on both amd64 and aarch64).

I traced the potential issue down in case of ORDER BY to pg_amop/pg_amproc
lookups. The specific lookups vary by operation but the pattern is the
same: first execution incurs catalog reads that get cached for subsequent
runs. This applies to DISTINCT (sort, not hashed), GROUP BY, window
functions, etc - which seems to isolate it to sort node.

All those I believe are actually metadata resolution (i.e. planner already
knows what to sort by).

Having said that - I'm happy to try to fix it. Just wondering what is the
right direction to go:

1. Pre-warm syscache during planning (tried locally and it works - but does
not feel conceptually right and it's rather 'hacky')
2. Pre-compute and store metadata and modify each plan node struct for each
type (lots of work). I believe that's what MergeJoin already does.
3. Or this is 'works as expected' - which I don't think it's the case and
would force me to do pre-warming on regresql side

Kind regards,

Radim

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2026-02-02 23:35:45 Re: libpq: Bump protocol version to version 3.2 at least until the first/second beta
Previous Message Nathan Bossart 2026-02-02 22:51:54 Re: refactor architecture-specific popcount code