Re: mosbench revisited

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: mosbench revisited
Date: 2011-08-04 01:59:57
Message-ID: CA+TgmobWi_tFQAFX13VryaW3ZoSxRxVQOebOPLb0SGNEeLZhuw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 3, 2011 at 6:21 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> On Aug 3, 2011, at 1:21 PM, Robert Haas wrote:
>> 1. "We configure PostgreSQL to use a 2 Gbyte application-level cache
>> because PostgreSQL protects its free-list with a single lock and thus
>> scales poorly with smaller caches."  This is a complaint about
>> BufFreeList lock which, in fact, I've seen as a huge point of
>> contention on some workloads.  In fact, on read-only workloads, with
>> my lazy vxid lock patch applied, this is, I believe, the only
>> remaining unpartitioned LWLock that is ever taken in exclusive mode;
>> or at least the only one that's taken anywhere near often enough to
>> matter.  I think we're going to do something about this, although I
>> don't have a specific idea in mind at the moment.
>
> This has been discussed before: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which itself references 2 other threads).
>
> The basic idea is: have a background process that proactively moves buffers onto the free list so that backends should normally never have to run the clock sweep (which is rather expensive). The challenge there is figuring out how to get stuff onto the free list with minimal locking impact. I think one possible option would be to put the freelist under it's own lock (IIRC we currently use it to protect the clock sweep as well). Of course, that still means the free list lock could be a point of contention, but presumably it's far faster to add or remove something from the list than it is to run the clock sweep.

Based on recent benchmarking, I'm going to say "no". It doesn't seem
to matter how short you make the critical section: a single
program-wide mutex is a loser. Furthermore, the "free list" is a
joke, because it's nearly always going to be completely empty. We
could probably just rip that out and use the clock sweep and never
miss it, but I doubt it would improve performance much.

I think what we probably need to do is have multiple clock sweeps in
progress at the same time. So, for example, if you have 8GB of
shared_buffers, you might have 8 mutexes, one for each GB. When a
process wants a buffer, it locks one of the mutexes and sweeps through
that 1GB partition. If it finds a buffer before returning to the
point at which it started the scan, it's done. Otherwise, it releases
its mutex, grabs the next one, and continues on until it finds a free
buffer.

The trick with any modification in this area is that pretty much any
degree of increased parallelism is potentially going to reduce the
quality of buffer replacement to some degree. So the trick will be to
try to squeeze out as much concurrency as possible while minimizing
degradation in the quality of buffer replacements.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message jordani 2011-08-04 02:24:05 Re: Incremental checkopints
Previous Message Alvaro Herrera 2011-08-04 01:32:21 Re: Compressing the AFTER TRIGGER queue