Re: pg_buffercache: Add per-relation summary stats

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Lukas Fittl <lukas(at)fittl(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Paul A Jungwirth <pj(at)illuminatedcomputing(dot)com>, Khoa Nguyen <khoaduynguyen(at)gmail(dot)com>
Subject: Re: pg_buffercache: Add per-relation summary stats
Date: 2026-06-18 14:09:43
Message-ID: CA+Tgmob_ooei61owb21CbLmAz=CTAFRrTSmXkwT1eg1=CDP_UQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 27, 2026 at 6:59 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> The main argument here seems to be the performance, and the initial
> message demonstrates a 10x speedup (2ms vs. 20ms) on a cluster with
> 128MB shared buffers. Unless I misunderstood what config it uses.

So, my opinion on this point is that the results Lukas shows later in
the thread are compelling. The query takes 13s and writes 1.2GB for
what should be a trivial monitoring query. That seems like it's pretty
clearly enough overhead to be a problem for a monitoring query. The
problem isn't even just that you can't afford to wait 13s for a query
you run every 10m -- it's that the query itself is consuming enough
system resources to skew your other monitoring. For example, if you're
monitoring your system load average or CPU usage or disk usage over
time, you're going to see spikes when this query runs. That's not the
worst thing that has ever happened to anyone, but it's definitely not
great, and I can totally understand someone not being willing to incur
that much overhead. I don't believe we should accept the argument that
this patch doesn't save enough to matter; I think it does.

> Let's assume it's worth it. I wonder what similar summaries might be
> interesting for users. I'd probably want to see a per-database summary,
> especially on a shared / multi-tenant cluster. But AFAICS I can
> calculate that from the pg_buffercache_relations() result, except that
> I'll have to recalculate the usagecount.

I think it would be better to return the total usagecount and let the
caller divide if they want, rather the average.

But more generally, I agree that we don't want something that is
overly specific to one person's use case. Thinking about how to make a
function like this useful to as many people as possible is a
worthwhile activity. I don't know what more we can do that makes
sense. For instance, we could add a database OID argument that can be
NULL or the OID of a database and it filters out everything else. I'm
not sure that would pull its weight -- the big gains are probably
coming from doing the aggregation using bespoke code, rather than
filtering out rows beforehand -- but maybe.

On the whole, I'm inclined to think we should accept this. There's
plenty of cases where it won't save much, and it is also true that it
would be nice to improve the core infrastructure so that queries like
this can be better-optimized. But I don't think that's going to happen
right away, and even when it does happen I bet the savings from a
patch like this will still be pretty significant. I also believe that
aggregating the pg_buffercache results by relation is probably a very
common use case, so it's doesn't seem to me as though there would be
ten other equally-compelling versions of this.

> One thing we lose by doing ad hoc aggregation (instead of just relying
> on the regular SQL aggregation operators) is lack of memory limit.
> There's a simple in-memory hash table, no spilling to disk etc. The
> simple pg_buffercache view does not have this issue, because the
> tuplestore will spill to disk after hitting work_mem. Simplehash won't.
>
> The entries are ~48B, so there would need to be buffers for ~100k
> (relfilenode,forknum) combinations to overflow 4MB. It's not very
> common, but I've seen systems with more relations that this. Would be
> good to show some numbers showing it's not an issue.

This is a good point, but I'm not sure I believe there's a real issue
here. It seems as though the kinds of systems where this function is
likely to be important for performance are probably those with 100GB+
of shared_buffers, so 12m+ buffers, so ... half a gigabyte? Maybe
somewhat more with memory allocation overheads and so forth? I feel
like if you have 100GB of shared_buffers, you probably have work_mem
set to 1GB+, or at least have that much memory free. And even then you
only need that if every single shared buffer belongs to a different
relation, which seems like a thing that will not occur in practice.

Obviously, there are things we could do to limit memory consumption
here. I think the easiest thing might be to just write out the entire
hash table in hash value order to a temporary file every time we
exhaust work_mem, and then do a merge pass over all those temporary
files at the end. I don't think we can plausibly need more than one
merge pass to keep memory usage within acceptable limits, and I don't
think this would need to be a crazy amount of code. But I'm also not
sure I believe we really need it. The concern about code maintenance
that has been raised is valid here as it is for all patches, so we
shouldn't bloat the patch with code that it doesn't really need, and I
think it's worth considering whether spill-to-disk code falls into
that category.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2026-06-18 14:24:03 Re: fix pg_mkdir_p to tolerate concurrent directory creation
Previous Message Matthias van de Meent 2026-06-18 14:08:19 Re: Commit Sequence Numbers and Visibility