mosbench revisited

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: mosbench revisited
Date: 2011-08-03 18:21:25
Message-ID: CA+TgmoZWdo9XrH=TN59GX8rJM9FgiezpAA-B57ZEVOGof49FVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

About nine months ago, we had a discussion of some benchmarking that
was done by the mosbench folks at MIT:

http://archives.postgresql.org/pgsql-hackers/2010-10/msg00160.php

Although the authors used PostgreSQL as a test harness for driving
load, it's pretty clear from reading the paper that their primary goal
was to stress the Linux kernel, so the applicability of the paper to
real-world PostgreSQL performance improvement is less than it might
be. Still, having now actually investigated in some detail many of
the same performance issues that they were struggling with, I have a
much clearer understanding of what's really going on here. In
PostgreSQL terms, here are the bottlenecks they ran into:

1. "We configure PostgreSQL to use a 2 Gbyte application-level cache
because PostgreSQL protects its free-list with a single lock and thus
scales poorly with smaller caches." This is a complaint about
BufFreeList lock which, in fact, I've seen as a huge point of
contention on some workloads. In fact, on read-only workloads, with
my lazy vxid lock patch applied, this is, I believe, the only
remaining unpartitioned LWLock that is ever taken in exclusive mode;
or at least the only one that's taken anywhere near often enough to
matter. I think we're going to do something about this, although I
don't have a specific idea in mind at the moment.

2. "PostgreSQL implements row- and table-level locks atop user-level
mutexes; as a result, even a non-conflicting row- or table-level lock
acquisition requires exclusively locking one of only 16 global
mutexes." I think that the reference to row-level locks here is a red
herring; or at least, I haven't seen any evidence that row-level
locking is a meaningful source of contention on any workload I've
tested. Table-level locks clearly are, and this is the problem that
the now-committed fastlock patch addressed. So, fixed!

3. "Our workload creates one PostgreSQL connection per server core and
sends queries (selects or updates) in batches of 256, aggregating
successive read-only transac- tions into single transactions. This
workload is intended to minimize application-level contention within
PostgreSQL in order to maximize the stress PostgreSQL places on the
kernel." I had no idea what this was talking about at the time, but
it's now obvious in retrospect that they were working around the
overhead imposed by acquiring and releasing relation and virtualxid
locks. My pending "lazy vxids" patch will address the remaining issue
here.

4. "With modified PostgreSQL on stock Linux, throughput for both
workloads collapses at 36 cores .. The main reason is the kernel's
lseek implementation." With the fastlock, sinval-hasmessages, and
lazy-vxid patches applied (the first two are committed now), it's now
much easier to run headlong into this bottleneck. Prior to those
patches, for this to be an issue, you would need to batch your queries
together in big groups to avoid getting whacked by the lock manager
and/or sinval overhead first. With those problems and the recently
discovered bottleneck in glibc's random() implementation fixed, good
old pgbench -S is enough to hit this problem if you have enough
clients and enough cores. And it turns out that the word "collapse"
is not an exaggeration. On a 64-core Intel box running RHEL 6.1,
performance ramped up from 24k TPS at 4 clients to 175k TPS at 32
clients and then to 207k TPS at 44 clients. After that it fell off a
cliff, dropping to 93k TPS at 52 clients and 26k TPS at 64 clients,
consuming truly horrifying amounts of system time in the process. A
somewhat tedious investigation revealed that the problem is, in fact,
contention on the inode mutex caused by lseek(). Results are much
better with -M prepared (310k TPS at 48 clients, 294k TPS at 64
clients). All one-minute tests with scale factor 100, fitting inside
8GB of shared_buffers (clearly not enough for serious benchmarking,
but enough to demonstrate this issue).

It would be nice if the Linux guys would fix this problem for us, but
I'm not sure whether they will. For those who may be curious, the
problem is in generic_file_llseek() in fs/read-write.c. On a platform
with 8-byte atomic reads, it seems like it ought to be very possible
to read inode->i_size without taking a spinlock. A little Googling
around suggests that some patches along these lines have been proposed
and - for reasons that I don't fully understand - rejected. That now
seems unfortunate. Barring a kernel-level fix, we could try to
implement our own cache to work around this problem. However, any
such cache would need to be darn cheap to check and update (since we
can't assume that relation extension is an infrequent event) and must
somehow having the same sort of mutex contention that's killing the
kernel in this workload.

5. With all of the above problems fixed or worked around, the authors
write, "PostgreSQL's overall scalability is primarily limited by
contention for the spinlock protecting the buffer cache page for the
root of the table index". This is the only problem on their list that
I haven't yet encountered in testing. I'm kind of interested by the
result, actually, as I had feared that the spinlock protecting
ProcArrayLock was going to be a bigger problem sooner. But maybe not.
I'm also concerned about the spinlock protecting the buffer mapping
lock that covers the root index page. I'll investigate further if and
when I come up with a way to dodge the lseek() contention problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Grzegorz Jaskiewicz 2011-08-03 18:28:19 Re: Further news on Clang - spurious warnings
Previous Message David E. Wheeler 2011-08-03 18:20:49 Re: Further news on Clang - spurious warnings