Re: [Patch] Optimize dropping of relation buffers using dlist

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>
Cc: 'Andres Freund' <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, 'Konstantin Knizhnik' <knizhnik(at)garret(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Patch] Optimize dropping of relation buffers using dlist
Date: 2020-08-06 21:33:34
Message-ID: 20200806213334.3bzadeirly3mdtzl@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 06, 2020 at 01:23:31AM +0000, k(dot)jamison(at)fujitsu(dot)com wrote:
>On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:
>Thank you for your constructive review and comments.
>Sorry for the late reply.
>> Hi,
>> On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
>> > Andres Freund <andres(at)anarazel(dot)de> writes:
>> > > Indeed. The buffer mapping hashtable already is visible as a major
>> > > bottleneck in a number of workloads. Even in readonly pgbench if s_b
>> > > is large enough (so the hashtable is larger than the cache). Not to
>> > > speak of things like a cached sequential scan with a cheap qual and wide
>> rows.
>> >
>> > To be fair, the added overhead is in buffer allocation not buffer lookup.
>> > So it shouldn't add cost to fully-cached cases. As Tomas noted
>> > upthread, the potential trouble spot is where the working set is
>> > bigger than shared buffers but still fits in RAM (so there's no actual
>> > I/O needed, but we do still have to shuffle buffers a lot).
>> Oh, right, not sure what I was thinking.
>> > > Wonder if the temporary fix is just to do explicit hashtable probes
>> > > for all pages iff the size of the relation is < s_b / 500 or so.
>> > > That'll address the case where small tables are frequently dropped -
>> > > and dropping large relations is more expensive from the OS and data
>> > > loading perspective, so it's not gonna happen as often.
>> >
>> > Oooh, interesting idea. We'd need a reliable idea of how long the
>> > relation had been (preferably without adding an lseek call), but maybe
>> > that's do-able.
>> IIRC we already do smgrnblocks nearby, when doing the truncation (to figure out
>> which segments we need to remove). Perhaps we can arrange to combine the
>> two? The layering probably makes that somewhat ugly :(
>> We could also just use pg_class.relpages. It'll probably mostly be accurate
>> enough?
>> Or we could just cache the result of the last smgrnblocks call...
>> One of the cases where this type of strategy is most intersting to me is the partial
>> truncations that autovacuum does... There we even know the range of tables
>> ahead of time.
>Konstantin tested it on various workloads and saw no regression.

Unfortunately Konstantin did not share any details about what workloads
he tested, what config etc. But I find the "no regression" hypothesis
rather hard to believe, because we're adding non-trivial amount of code
to a place that can be quite hot.

And I can trivially reproduce measurable (and significant) regression
using a very simple pgbench read-only test, with amount of data that
exceeds shared buffers but fits into RAM.

The following numbers are from a x86_64 machine with 16 cores (32 w HT),
64GB of RAM, and 8GB shared buffers, using pgbench scale 1000 (so 16GB,
i.e. twice the SB size).

With simple "pgbench -S" tests (warmup and then 15 x 1-minute runs with
1, 8 and 16 clients - see the attached script for details) I see this:

1 client 8 clients 16 clients
master 38249 236336 368591
patched 35853 217259 349248
-6% -8% -5%

This is average of the runs, but the conclusions for medians are almost
exactly te same.

>But I understand the sentiment on the added overhead on BufferAlloc.
>Regarding the case where the patch would potentially affect workloads
>that fit into RAM but not into shared buffers, could one of Andres'
>suggested idea/s above address that, in addition to this patch's
>possible shared invalidation fix? Could that settle the added overhead
>in BufferAlloc() as temporary fix?

Not sure.

>Thomas Munro is also working on caching relation sizes [1], maybe that
>way we could get the latest known relation size. Currently, it's
>possible only during recovery in smgrnblocks.

It's not clear to me how would knowing the relation size help reducing
the overhead of this patch?

Can't we somehow identify cases when this optimization might help and
only actually enable it in those cases? Like in a recovery, with a lot
of truncates, or something like that.


Tomas Vondra
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size application/x-sh 602 bytes
master.csv text/csv 645 bytes
patched.csv text/csv 645 bytes

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-08-06 21:35:33 Re: PROC_IN_ANALYZE stillborn 13 years ago
Previous Message Peter Geoghegan 2020-08-06 21:33:02 Re: amcheck verification for GiST and GIN