Re: pg_buffercache: Add per-relation summary stats

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>, chaturvedipalak1911(at)gmail(dot)com
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Lukas Fittl <lukas(at)fittl(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Paul A Jungwirth <pj(at)illuminatedcomputing(dot)com>, Khoa Nguyen <khoaduynguyen(at)gmail(dot)com>
Subject: Re: pg_buffercache: Add per-relation summary stats
Date: 2026-03-28 04:18:26
Message-ID: CAExHW5sMsaz1j+hrdhyo-DJp7JCgJx87=q2iJfOc_9mwYWyvmw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 28, 2026 at 4:28 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 3/26/26 05:21, Ashutosh Bapat wrote:
> > On Wed, Mar 25, 2026 at 10:19 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >>
> >> On Tue, Mar 24, 2026 at 11:47 PM Lukas Fittl <lukas(at)fittl(dot)com> wrote:
> >>>
> >>> Hi Ashutosh,
> >>>
> >>> On Tue, Mar 24, 2026 at 11:24 PM Ashutosh Bapat
> >>> <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> >>>> I know we already have a couple of hand-aggregation functions but I am
> >>>> hesitant to add more of these. Question is where do we stop? For
> >>>> example, the current function is useless if someone wants to find the
> >>>> parts of a relation which are hot since it doesn't include page
> >>>> numbers. Do we write another function for the same? Or we add page
> >>>> numbers to this function and then there's hardly any aggregation
> >>>> happening. What if somebody wanted to perform an aggregation more
> >>>> complex than just count() like average number of buffers per relation
> >>>> or distribution of relation buffers in the cache, do they write
> >>>> separate functions?
> >>>
> >>> I think the problem this solves for, which is a very common question I
> >>> hear from end users, is "how much of this table/index is in cache" and
> >>> "was our query slow because the cache contents changed?".
> >>>
> >>> It can't provide a perfect answer to all questions regarding what's in
> >>> the cache (i.e. it won't tell you which part of the table is cached),
> >>> but its in line with other statistics we do already provide in
> >>> pg_stat_user_tables etc., which are all aggregate counts, not further
> >>> breakdowns.
> >>>
> >>> Its also a reasonable compromise on providing something usable that
> >>> can be shown on dashboards, as I've seen in collecting this
> >>> information using the existing methods from small production systems
> >>> in practice over the last ~1.5 years.
> >>
> >> Regarding the proposed statistics, I find them reasonably useful for
> >> many users. I'm not sure we need to draw a strict line on what belongs
> >> in the module. If a proposed function does exactly what most
> >> pg_buffercache users want or are already writing themselves, that is
> >> good enough motivation to include it.
> >>
> >> I think pg_visibility is a good precedent here. In that module, we
> >> have both pg_visibility_map() and pg_visibility_map_summary(), even
> >> though we can retrieve the exact same results as the latter by simply
> >> using the former:
> >>
> >> select sum(all_visible::int), sum(all_frozen::int) from
> >> pg_visibility_map('test') ;
> >>
> >
> > A summary may still be ok, but this proposal is going a bit farther,
> > it's grouping by one subset which should really be done by GROUP BY in
> > SQL. And I do
> >
> > I am afraid that at some point, we will start finding all of these to
> > be a maintenance burden. At that point, removing them will become a
> > real pain for the backward compatibility reason. For example
> > 1. The proposed function is going to add one more test to an already
> > huge testing exercise for shared buffers resizing.
> > 2. If we change the way to manage buffer cache e.g. use a tree based
> > cache instead of hash + array cache, each of the functions which
> > traverses the buffer cache array is going to add work - adjusting it
> > to the new data structure - and make a hard project even harder. In
> > this case we have other ways to get the summary, so the code level
> > scan of buffer cache is entirely avoidable.
> >
> > If I am the only one opposing it, and there are more senior
> > contributors in favour of adding this function, we can accept it.
> >
>
> I understand this argument - we have SQL, which allows us to process the
> data in a flexible way, without hard-coding all interesting groupings.
> The question is whether this particular grouping is special enough to
> warrant a custom *faster* function.
>
> The main argument here seems to be the performance, and the initial
> message demonstrates a 10x speedup (2ms vs. 20ms) on a cluster with
> 128MB shared buffers. Unless I misunderstood what config it uses.
>
> I gave it a try on an azure VM with 32GB shared buffers, to make it a
> bit more realistic, and my timings are 10ms vs. 700ms. But I also wonder
> if the original timings really were from a cluster with 128MB, because
> for me that shows 0.3ms vs. 3ms (so an order of magnitude faster than
> what was reported). But I suppose that's also hw specific.
>
> Nevertheless, it is much faster. I haven't profiled this but I assume
> it's thanks to not having to write the entries into a tuplestore (and
> possibly into a tempfile).

Parallely myself and Palak Chaturvedi developed a quick patch to
modernise pg_buffercache_pages() and use tuplestore so that it doesn't
have to rely on NBuffers being the same between start of the scan,
when memory allocated, when the scan ends - a condition possible with
resizing buffer cache. It seems to improve the timings by about 10-30%
on my laptop for 128MB buffercache size. Without this patch the time
taken to execute Lukas's query varies between 10-15ms on my laptop.
With this patch it varies between 8-9ms. So the timing is more stable
as a side effect. It's not a 10x improvement that we are looking for
but it looks like a step in the right direction. That improvement
seems to come purely because we avoid creating a heap tuple. I wonder
if there are some places up in the execution tree where full
heaptuples get formed again instead of continuing to use minimal
tuples or places where we perform some extra actions that are not
required.

I didn't dig into the history to find out why we didn't modernize
pg_buffercache_pages(). I don't see any hazard though.

Lukas's patch allocates the hash table in memory entirely, whereas
tuplestore restricts memory usage to work_mem, so it might cause the
function to use more memory than user expects it to use when size of
the hash table grows beyond work_mem.

--
Best Wishes,
Ashutosh Bapat

Attachment Content-Type Size
v20260328-0001-pg_buffercache_pages-modernization-and-opt.patch text/x-patch 11.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message SATYANARAYANA NARLAPURAM 2026-03-28 04:19:55 Re: Add pg_stat_autovacuum_priority
Previous Message SATYANARAYANA NARLAPURAM 2026-03-28 04:14:01 Re: Add pg_stat_autovacuum_priority