parallelizing slow queries for multiple cores (PostgreSQL + Gearman)

From: henk de wit <henk53602(at)hotmail(dot)com>
To: <pgsql-performance(at)postgresql(dot)org>, <mjy(at)geizhals(dot)at>
Subject: parallelizing slow queries for multiple cores (PostgreSQL + Gearman)
Date: 2009-03-18 12:52:52
Message-ID: COL104-W6C6DC49AD7187E84F0E43F5990@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


Hi,
>Has anyone done similar work in the light of upcoming many-core CPUs/systems? Any better results than 2x improvement?

Yes, in fact I've done a very similar thing on a quad CPU box a while back. In my case the table in question had about 26 million rows. I did nothing special to the table (no cluster, no partitioning, nothing, of course the table did had the appropriate indexes). Queries on this table are analytic/reporting kind of queries. Basically they are just aggregations over a large number of rows. E.g. "the sum of column1 and the sum of column2 where time is some time and columnA has some value and columnB has some other value", that kind of thing. From analysis the queries appeared to be nearly 100% CPU bound.
In my (Java) application I divided a reporting query for say the last 60 days in 2 equal portions: day 1 to 30 and day 31 to 60 and assigned these to two worker threads. The results of these worker threads was merged using a simple resultset merge (the end result is simply the total of all rows returned by thread1 and thread2). The speed up I measured on the quad box was a near perfect factor 2. I then divided the workload in 4 equal portions: day 1 to 15, 16 to 30, 31 to 45 and 46 till 60. The speed up I measured was only a little less then a factor 4. I my situation too, the time I measured included dispatching the jobs to a thread pool and merging their results.
Of course, such a scheme can only be easily used when all workers return individual rows that are directly part of the end result. If some further calculation has to be done on those rows, which happens to be the same calculation that is also done in the query you are parallelizing, then in effect you are duplicating logic. If you do that a lot in your code you can easily create a maintenance nightmare. Also, you have to be aware that without additional measures, every worker lives in its own transaction. Depending on the nature of the data this could potentially result in inconsistent data being returned. In your case, on tables generated once per day this wouldn't be the case, but as a general technique you have to be aware of this.
Anyway, it's very clear that computers are moving to many-core architectures. Simple entry level servers already come these days with 8 cores. I've asked a couple of times on this list whether PG is going to support using multiple cores for a single query anytime soon, but this appears to be very unlikely. Until then it seems the only way to utilize multiple cores for a single query is doing it at the application level or by using something like pgpool-II.

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Browse pgsql-performance by date

  From Date Subject
Next Message Jignesh K. Shah 2009-03-18 13:38:14 Re: Proposal of tunable fix for scalability of 8.4
Previous Message Matthew Wakeling 2009-03-18 12:33:42 Re: Proposal of tunable fix for scalability of 8.4