Reducing spinlock acquisition within clock sweep loop

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Reducing spinlock acquisition within clock sweep loop
Date: 2015-04-21 19:46:04
Message-ID: CAHyXU0x_cRPcBXqNeMye_tEZW2Dsy1T0cCDLW+EWQjH_nH6KnQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Background:
The main sources of contention, buffer_strategy_lock, has been removed
FWICT via 5d7962c6 and d72731a7. However, during the sweep each tick
locks the buffer header via spinlock in order to to adjust
usage_count. I believe that removing this lock is possible
optimization fruit. I doubt that this is a major contention point for
most workloads (over the years, I've only seen one or two cases where
I suspected clock sweep was getting gummed up by buffer locks) but for
systems that are already under lock duress it's hard to argue that
spamming spinlocks in a tight loop makes things easier for the
hardware that is protecting cachelines. In short, clock sweep heavy
workloads are issuing a lot of atomic ops. Any savings here is good.

Right now usage_count ranges across a small number (currently 0-5),
incrementing upon allocation and decrementing on sweep. Because of
this, each tick of the clock sweep spinlocks the buffer in preparation
of A. the buffer being returned or B. the buffer having a non zero
usage count so that it can be safely decremented. B is potential
optimization fruit, I think.

The idea:
Instead, why not let usage_count rotate around on it's own in a
clock-like fashion? the idea is that, instead of defining usage count
as a fixed offset from zero, it is based on the number of times the
buffer has been touched since the last sweep. In other words, each
time the buffer clock crosses midnight, 'usage count' for the whole
pool drops by one, implicitly. Each time a buffer is touched, it's
usage_count is set to 'completePasses' + 1, but, just like today, is
never allowed to increment past completePasses + 5. So, what used to
be usage_count becomes defined as the distance between the usage_count
and completePasses.

Because of usage_count tracking rotation, calculating what used to be
usage_count requires a bit of logic to arrive at the right number as
rotation happens. I'm pretty sure all the rotation edge cases can be
handled but I'm glossing over them for now.

Clock sweep buffer acquisition currently looks like this:
loop:
tick();
spinlock();
in use? goto loop
usage_count > 0? yes: decrement, goto loop
return buffer;

Proposed buffer acquisition would look like:
loop:
tick();
victim < last_victim? yes: atomic re-fetch completePasses (see below)
usage_count > completePasses? yes: goto loop
try_spinlock(); no lock? goto loop
usage_count > completePasses? yes: goto loop
in use? goto loop
return buffer;

(try_spinlock is an adorned TAS() as a opposed to a full spin). Usage
count is double checked after lock acquisition in case the local copy
is stale.

Bad stuff:
1. Complexity. Is it worth it? Calculating proper 'usage_count'
distance in the face of two rotating numbers requires some clever
coding.

2. Unfortunately, since we've (for very good reasons) got a clock
sweep that can be concurrently engaged by many actors, it's no longer
safe to assume that completePasses can be safely examined outside of
the sweep loop but must intead be checked with every tick. Doing this
with another atomic read would defeat the entire purpose of the patch,
but perhaps we can check it based on a heuristic: it's examined when
victim is less than the last_seen victim (which is locally saved off).

...which would add up to a tiny (zero?) risk of misjudging
usage_distance. The fallout, if it were to happen, would be to
possibly to return a buffer that would have a usage_count = 1 with the
old method.

3. Since proper usage_count, aka distance is managed by being relative
to completePasses, which slightly changes the semantics due to the
fact the entire buffer pool drop in terms of usage_count at once. I'm
doubtful this matters though.

Just spitballing here -- wondering if the idea has legs. I think this
should compatible with Robert's usage_count tweaking with flags FWICT.
I'm inclined to work up a POC patch unless this is unworkable for some
reason. The payoff, for workloads with high sweep activity, would be
drastically reduced spinlock activity and associated cache line
stress.

merlin

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2015-04-21 19:50:07 Re: Row security violation error is misleading
Previous Message Robert Haas 2015-04-21 19:18:03 Re: PATCH: Add 'pid' column to pg_replication_slots