Re: measuring lwlock-related latency spikes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: measuring lwlock-related latency spikes
Date: 2012-04-01 03:05:27
Message-ID: CA+TgmoY073r4-Hp7C3ULat2oLPFZ=udueC+EPj4T+xS_ZWwiTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 31, 2012 at 9:29 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I've discovered a bug in my code that was causing it to print at most
> 2 histogram buckets per lwlock, which obviously means that my previous
> results were totally inaccurate.  Ah, the joys of benchmarking.  I
> found the problem when I added code to log a message any time an
> lwlock wait exceeded a certain time threshold, and it fired far more
> often than the previous results would have indicated.  In particular,
> it turns out that long waits for WALInsertLock are extremely common
> and not, as the previous results appeared to indicated, unheard-of.
> I'm rerunning my tests now and will post the updated,
> hopefully-accurate results when that's done.
>
> /me attempts to remove egg from face.

All right, so with the aforementioned bug fixed (see attached, revised
patch), there are now massive latency spikes popping out all over the
place: on my latest run, there were 377 distinct lwlocks that took >=
1s to acquire on at least one occasion during this 30-minute run.
Some of those locks, of course, had more than one problem event. In
total, somebody waited >= 1 s for a lock 4897 times during this test.
These break down as follows. Note that the "blocked by" is the person
who had most recently acquired the lock as of the start of the wait,
and is not necessarily solely responsible for the full duration of the
wait due to shared locks and queue jumping.

1 waited at heapam.c:1651 blocked by bufmgr.c:2475
1 waited at heapam.c:2844 blocked by heapam.c:2758
1 waited at hio.c:335 blocked by heapam.c:1651
1 waited at hio.c:336 blocked by hio.c:336
1 waited at indexam.c:521 blocked by hio.c:345
1 waited at xlog.c:2090 blocked by xlog.c:2090
2 waited at bufmgr.c:1671 blocked by bufmgr.c:2475
2 waited at indexam.c:521 blocked by heapam.c:3464
2 waited at nbtpage.c:650 blocked by nbtinsert.c:124
2 waited at xlog.c:1502 blocked by xlog.c:2090
2 waited at xlog.c:2241 blocked by xlog.c:1502
3 waited at slru.c:310 blocked by slru.c:404
4 waited at indexam.c:521 blocked by hio.c:335
4 waited at indexam.c:521 blocked by hio.c:336
4 waited at xlog.c:2241 blocked by xlog.c:2090
6 waited at hio.c:336 blocked by heapam.c:2758
12 waited at indexam.c:521 blocked by bufmgr.c:2475
20 waited at xlog.c:2090 blocked by xlog.c:2241
26 waited at heapam.c:2758 blocked by indexam.c:521
29 waited at heapam.c:2758 blocked by heapam.c:2758
80 waited at xlog.c:1502 blocked by xlog.c:2241
89 waited at indexam.c:521 blocked by heapam.c:2758
115 waited at varsup.c:65 blocked by varsup.c:65
1540 waited at slru.c:310 blocked by slru.c:526
2948 waited at xlog.c:909 blocked by xlog.c:909

xlog.c:909 is the LWLockAcquire of WALInsertLock from within
XLogInsert. slru.c:310 is in SimpleLruWaitIO(), where we attempt to
grab the SLRU buffer lock in shared mode after releasing the control
lock. slru.c:526 is in SlruInternalWritePage(), where we hold the
buffer lock while writing the page. This is commit
194b5ea3d0722f94e8a6ba9cec03b858cc8c9370, if you want to look up an of
the other line numbers.

If I filter for waits greater than 8s, a somewhat different picture emerges:

2 waited at indexam.c:521 blocked by bufmgr.c:2475
212 waited at slru.c:310 blocked by slru.c:526

In other words, some of the waits for SLRU pages to be written are...
really long. There were 126 that exceeded 10 seconds and 56 that
exceeded 12 seconds. "Painful" is putting it mildly.

I suppose one interesting question is to figure out if there's a way I
can optimize the disk configuration in this machine, or the Linux I/O
scheduler, or something, so as to reduce the amount of time it spends
waiting for the disk. But the other thing is why we're waiting for
SLRU page writes to begin with. My guess based on previous testing is
that what's happening here is (1) we examine a tuple on an old page
and decide we must look up its XID, (2) the relevant CLOG page isn't
in cache so we decide to read it, but (3) the page we decide to evict
happens to be dirty, so we have to write it first. That sure seems
like something that a smart background writer ought to be able to fix
for us. Simon previously posted a patch for that:

http://archives.postgresql.org/pgsql-hackers/2012-01/msg00571.php

...but the testing I did at the time didn't seem to show a real clear benefit:

http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results

The obvious question here is: was that a problem with the patch, or a
problem with my testing methodology, or is it just that the
performance characteristics of the machine I used for that test (Nate
Boley's 32-core AMD box) were different from this one (IBM POWER7)? I
don't know, and I think I'm out of time to play with this for this
weekend, but I'll investigate further when time permits. Other
thoughts welcome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
lwhistogram-v2.patch application/octet-stream 11.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2012-04-01 05:40:17 new group commit behavior not helping?
Previous Message Robert Haas 2012-04-01 01:46:02 Re: Command Triggers patch v18