Re: measuring lwlock-related latency spikes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: measuring lwlock-related latency spikes
Date: 2012-03-31 21:14:38
Message-ID: CA+TgmobRVFHKt2pdHAJ_WsAF2SfcJdVm3H1KK0scsQ8Z-WoqeA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 31, 2012 at 8:58 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> But I didn't think we were ever supposed to hold content
>> locks for that long.
>
> Isn't that lock held while doing visibility checks?

Nope. heap_update() and friends do a very complicated little dance to
avoid that. Heikki articulated that rule when he installed the
visibility map in 8.4, and I had to work pretty hard to preserve it in
9.2 when I did the work to make the visibility map crash-safe, but now
I'm glad I did.

> What about I/O
> waiting for a clog page to be read?

I'm pretty sure that can happen, because TransactionIdIsCommitted()
can get called from HeapTupleSatisfies*() which pretty much only gets
called while holding the page lock. I don't know whether it's the
cause of these particular stalls, but it's plausible if the CLOG cache
is getting thrashed hard enough.

I did discover one systematic error in my testing methodology: I only
instrumented LWLockAcquire(), not LWLockAcquireOrWait(). The latter
turns out to be an important case in this instance since we use that
when flushing WAL. Just running the regression tests on my laptop
suggests that with that oversight corrected, WALWriteLock is going to
pop out as a huge source of latency spikes. But I will know for sure
after I do a more formal test run. I also modified the code to print
out debugging output every time we have to wait for a long time (I
think I've got it set to 10ms right now, but I might raise that if
it's too verbose in a real test run) with the file and line number
attempting the lock acquisition that blocked, and the file and line
number that had most recently acquired the lock at the time we first
discovered we needed to wait.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2012-03-31 21:34:33 Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
Previous Message Mike Roest 2012-03-31 20:54:23 Re: pg_dump incredibly slow dumping a single schema from a large db