Re: measuring lwlock-related latency spikes

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: measuring lwlock-related latency spikes
Date: 2012-04-01 11:07:05
Message-ID: CA+U5nMJsZXPYfjEq-uJipkDq8vE7M0gQneNLa5AVfxtwwg0vBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Apr 1, 2012 at 4:05 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> If I filter for waits greater than 8s, a somewhat different picture emerges:
>
>      2 waited at indexam.c:521 blocked by bufmgr.c:2475
>    212 waited at slru.c:310 blocked by slru.c:526
>
> In other words, some of the waits for SLRU pages to be written are...
> really long.  There were 126 that exceeded 10 seconds and 56 that
> exceeded 12 seconds.  "Painful" is putting it mildly.

Interesting. The total wait contribution from those two factors
exceeds the WALInsertLock wait.

> I suppose one interesting question is to figure out if there's a way I
> can optimize the disk configuration in this machine, or the Linux I/O
> scheduler, or something, so as to reduce the amount of time it spends
> waiting for the disk.  But the other thing is why we're waiting for
> SLRU page writes to begin with.

First, we need to determine that it is the clog where this is happening.

Also, you're assuming this is an I/O issue. I think its more likely
that this is a lock starvation issue. Shared locks queue jump
continually over the exclusive lock, blocking access for long periods.

I would guess that is also the case with the index wait, where I would
guess a near-root block needs an exclusive lock, but is held up by
continual index tree descents.

My (fairly old) observation is that the shared lock semantics only
work well when exclusive locks are fairly common. When they are rare,
the semantics work against us.

We should either implement 1) non-queue jump semantics for certain
cases 2) put a limit on the number of queue jumps that can occur
before we let the next x lock proceed instead. (2) sounds better, but
keeping track might well cause greater overhead.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2012-04-01 11:22:53 Re: Command Triggers patch v18
Previous Message Jeff Janes 2012-04-01 06:06:56 Re: new group commit behavior not helping?