Re: Proposal of tunable fix for scalability of 8.4

From: Scott Carey <scott(at)richrelevance(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Proposal of tunable fix for scalability of 8.4
Date: 2009-03-12 22:15:51
Message-ID: C5DED7A7.33B2%scott@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On 3/12/09 1:35 PM, "Greg Smith" <gsmith(at)gregsmith(dot)com> wrote:

On Thu, 12 Mar 2009, Jignesh K. Shah wrote:

> As soon as I get more "cycles" I will try variations of it but it would
> help if others can try it out in their own environments to see if it
> helps their instances.

What you should do next is see whether you can remove the bottleneck your
test is running into via using a connection pooler.

I doubt it is running into a bottleneck due to that, the symptoms aren't right. He can change his test to have near zero delay to simulate such a connection pool.

If it was an issue due to concurrency at that level, the results would not have scaled linearly with user count to a plateau the way they did. There would be a steep drop-down from thrashing as concurrency kept going up. Context switch data would help, since the thrashing ends up as a measurable there. No evidence of concurrency thrashing yet that I see, but more tests and data would help.

The disconnect, is that the Users column in his data does not represent back-ends. It represents concurrent users on the front-end. Whether these while idle pool or not is not clear. It would be useful to rule that possibility out but that looks like an improbable diagnosis to me given the lack of performance decrease as concurrency goes up.
Furthermore, if the problem was due to too much concurrency in the database with active connections, its hard to see how changing the lock code would change the result the way it did - increasing CPU and throughput accordingly. Again, context switch rate info would help rule out many possibilities.

That's what I think
most informed people would do were you to ask how to setup an optimal
environment using PostgreSQL that aimed to serve thousands of clients.
If that makes your bottleneck go away, that's what you should be
recommending to customers who want to scale in this fashion too.

First just run a test with a tiny delay (5ms? 0?) and fewer users to compare. If your theory that a connection pooler would help, that test would provide higher throughput with low user count and not be lock limited. This may be easier to run than setting up a pooler, though he should investigate one regardless.

If the
bottleneck moves to somewhere else, that new hot spot might be one people
care more about. Given that there are multiple good pooling solutions
floating around already, it's hard to justify dumping coding and testing
resources here if that makes the problem move somewhere else.

Its worth ruling out given that even if the likelihood is small, the fix is easy. However, I don't see the throughput drop from peak as more concurrency is added that is the hallmark of this problem - usually with a lot of context switching and a sudden increase in CPU use per transaction.

The biggest disconnect in load testing almost always occurs over the definition of "concurrent users".
Think of an HTTP app, backed by a db - about as simple as it gets these days (this is fun with 5, 6 tier fanned out stuff).

"Users" could mean:
Number of application user logins used.
Number of test harness threads or processes that are active.
Number of open HTTP connections
Number of HTTP requests being processed
Number of connections from the app to the db
Number of active connections from the app to the db

Knowing which of these is the topic, and what that means in relation to all the others, is often messy. Without knowing which one it is in a result, you can still learn a lot. The data in the results here prove its not the last one on the list above, nor the first one. It could still be any of the middle four, but is most likely #2 or the second to last one (which might be equivalent).

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Carey 2009-03-12 22:57:05 Re: Proposal of tunable fix for scalability of 8.4
Previous Message Scott Carey 2009-03-12 21:45:54 Re: Proposal of tunable fix for scalability of 8.4