Re: pg_buffercache: Add per-relation summary stats

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, chaturvedipalak1911(at)gmail(dot)com, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Lukas Fittl <lukas(at)fittl(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Paul A Jungwirth <pj(at)illuminatedcomputing(dot)com>, Khoa Nguyen <khoaduynguyen(at)gmail(dot)com>
Subject: Re: pg_buffercache: Add per-relation summary stats
Date: 2026-04-07 13:47:26
Message-ID: bdljrq3k5wm2jzxntpbenzbyozjx4syf43taodgxasc6qbqooo@nvnstsdlhboe
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-04-07 16:07:45 +0300, Heikki Linnakangas wrote:
> On 28/03/2026 06:18, Ashutosh Bapat wrote:
> > Parallely myself and Palak Chaturvedi developed a quick patch to
> > modernise pg_buffercache_pages() and use tuplestore so that it doesn't
> > have to rely on NBuffers being the same between start of the scan,
> > when memory allocated, when the scan ends - a condition possible with
> > resizing buffer cache. It seems to improve the timings by about 10-30%
> > on my laptop for 128MB buffercache size. Without this patch the time
> > taken to execute Lukas's query varies between 10-15ms on my laptop.
> > With this patch it varies between 8-9ms. So the timing is more stable
> > as a side effect. It's not a 10x improvement that we are looking for
> > but it looks like a step in the right direction. That improvement
> > seems to come purely because we avoid creating a heap tuple. I wonder
> > if there are some places up in the execution tree where full
> > heaptuples get formed again instead of continuing to use minimal
> > tuples or places where we perform some extra actions that are not
> > required.

I don't think that's the reason for the improvement - tuplestore_putvalues()
forms a minimal tuple, and the cost to form a minimal tuple and a heap tuple
aren't meaningfully different.

I think the problem is that we materialize rowmode SRFs as a tuplestore if
they are in the from list. You can easily see this even with just
generate_series():

postgres[1520825][1]=# SELECT count(*) FROM generate_series(1, 1000000);
┌─────────┐
│ count │
├─────────┤
│ 1000000 │
└─────────┘
(1 row)

Time: 117.939 ms
postgres[1520825][1]=# SELECT count(*) FROM (SELECT generate_series(1, 1000000));
┌─────────┐
│ count │
├─────────┤
│ 1000000 │
└─────────┘
(1 row)

Time: 58.914 ms

Of course, because pg_buffercache_pages() is archaicially defined without
defininig its output columns, you can't actually use it in the select list.

But that can be fixed:

CREATE FUNCTION pg_buffercache_pages_fast(OUT bufferid integer, OUT relfilenode oid, OUT reltablespace oid, OUT reldatabase oid,
OUT relforknumber int2, OUT relblocknumber int8, OUT isdirty bool, OUT usagecount int2,
OUT pinning_backends int4)
RETURNS SETOF RECORD
AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
LANGUAGE C PARALLEL SAFE;

60GB of s_b, mostly filled, with 257c8231bf97a77378f6fedb826b1243f0a41612
reverted.

SELECT count(*) FROM (SELECT pg_buffercache_pages_fast());
Time: 1518.704 ms (00:01.519)

SELECT count(*) FROM pg_buffercache_pages_fast();
Time: 2008.101 ms (00:02.008)

> > I didn't dig into the history to find out why we didn't modernize
> > pg_buffercache_pages(). I don't see any hazard though.
>
> Committed this modernization patch, thanks!
>
> It would be nice to have a proper row-at-a-time mode that would avoid
> materializing the result, but collecting all the data in a temporary array
> is clearly worse than just putting them to the tuplestore directly. The only
> reason I can think of why we'd prefer to use a temporary array like that is
> to get a more consistent snapshot of all the buffers, by keeping the time
> spent scanning the buffers as short as possible. But we're not getting a
> consistent view anyway, it's just a matter of degree.

Seems like a reasonably large difference in degree whether you have a snapshot
collected in one loop, or you do things like spilling a tuplestore to disk in
between.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matheus Alcantara 2026-04-07 13:52:14 Re: Asynchronous MergeAppend
Previous Message Alexander Korotkov 2026-04-07 13:46:29 Re: Implement waiting for wal lsn replay: reloaded