tracking context switches with perf record

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: tracking context switches with perf record
Date: 2012-03-30 16:27:36
Message-ID: CA+TgmoZ3bzMKrhXknDdqJMpw_7690Uc_Q6e_pwEKJ8FEVLqM=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been playing around with "perf record" on the IBM POWER7 machine,
and it's pretty cool. One of the things I don't like is that some of
the tools don't produce very nice text reports that you can cut and
paste into an email - it's kind of a text-based GUI where you can zoom
around and look at things. Still, it's nice.

One of the neat things I've just discovered is that you can use a
call-graph report on context-switches to find out what user code is
resulting in context-switching. For example, I ran a 32-client
pgbench test at scale factor 300 for half an hour and captured data
for 10 minutes relatively near the beginning. The partially expanded
call tree for postgres looks like this:

- 87.79% .schedule_timeout
+ 99.99% .unix_stream_recvmsg
- 11.63% .SyS_semtimedop
.SyS_ipc
syscall_exit
semop
- PGSemaphoreLock
- 98.48% LWLockAcquire
- 53.11% XLogInsert
+ 48.39% heap_update
+ 18.15% heap_insert
+ 16.36% RecordTransactionCommit
+ 15.67% log_heap_clean
+ 0.98% _bt_insertonpg
+ 18.71% ProcArrayEndTransaction
- 14.77% SimpleLruReadPage_ReadOnly
TransactionIdGetStatus
TransactionLogFetch
- TransactionIdDidCommit
+ 53.52% HeapTupleSatisfiesVacuum
+ 45.00% HeapTupleSatisfiesMVCC
+ 0.82% HeapTupleHeaderAdvanceLatestRemovedXid
+ 6.08% TransactionIdSetPageStatus
+ 2.63% LockBuffer
+ 1.92% GetSnapshotData
+ 1.51% SimpleLruReadPage
+ 1.52% ProcSleep
+ 0.50% do_work

So what you can see here is that all the context switches that
happened in postgres with the result either of a .schedule_timeout or
a .SyS_semtimedop. 87.79% of the context switches came from
.schedule_timeout, and 99.99% of those came from unix_stream_recvmsg.
If you expand that branch of the call tree, you find that all of them
are coming eventually from secure_read; the server is waiting for a
new query from the client. This is, obviously, overhead we can't
eliminate from this test; waiting for the client is part of the job.
The other 11.63% of the context switches result from PGSemaphoreLock
calling semop. Of those, 98.48% come from LWLockAcquire and 1.52%
come from ProcSleep. The ProcSleep calls are all from blocking while
attempting to acquire a heavyweight lock; and nearly all of them are
from conflicting tuple locks. However, they account for such a small
portion of the overall context switches that they're probably not
worth worrying about.

The more interesting waits, in my view anyway, are the ones that come
from LWLockAcquire, which account for nearly all of the semaphore
sleeps. As you can see, XLogInsert accounts for over half of those,
and ProcArrayEndTransaction and SimpleLruReadPage_ReadOnly account for
most of the rest. Just out of curiosity, I expanded a couple of those
another level, and you can see what's going on there, too. I was
slightly surprised by the number of calls to log_heap_clean, since I
wouldn't have expected vacuuming to generate that much work; it turns
out that 99.9% of those are HOT pruning events happening during query
execution.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marko Kreen 2012-03-30 16:30:24 Re: Speed dblink using alternate libpq tuple storage
Previous Message Tom Lane 2012-03-30 16:18:25 Re: pg_dump incredibly slow dumping a single schema from a large db