Re: Scaling up PostgreSQL in Multiple CPU / Dual Core

From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Scaling up PostgreSQL in Multiple CPU / Dual Core
Date: 2006-03-24 18:21:23
Message-ID: 60zmjfsnz0.fsf@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

jnasby(at)pervasive(dot)com ("Jim C. Nasby") writes:
> On Thu, Mar 23, 2006 at 09:22:34PM -0500, Christopher Browne wrote:
>> Martha Stewart called it a Good Thing when smarlowe(at)g2switchworks(dot)com (Scott Marlowe) wrote:
>> > On Thu, 2006-03-23 at 10:43, Joshua D. Drake wrote:
>> >> > Has someone been working on the problem of splitting a query into pieces
>> >> > and running it on multiple CPUs / multiple machines? Yes. Bizgress has
>> >> > done that.
>> >>
>> >> I believe that is limited to Bizgress MPP yes?
>> >
>> > Yep. I hope that someday it will be released to the postgresql global
>> > dev group for inclusion. Or at least parts of it.
>>
>> Question: Does the Bizgress/MPP use threading for this concurrency?
>> Or forking?
>>
>> If it does so via forking, that's more portable, and less dependent on
>> specific complexities of threading implementations (which amounts to
>> non-portability ;-)).
>>
>> Most times Jan comes to town, we spend a few minutes musing about the
>> "splitting queries across threads" problem, and dismiss it again; if
>> there's the beginning of a "split across processes," that's decidedly
>> neat :-).
>
> Correct me if I'm wrong, but there's no way to (reasonably) accomplish
> that without having some dedicated extra processes laying around that
> you can use to execute the queries, no? In other words, the cost of a
> fork() during query execution would be too prohibitive...

Counterexample...

The sort of scenario we keep musing about is where you split off a
(thread|process) for each partition of a big table. There is in fact
a natural such partitioning, in that tables get split at the 1GB mark,
by default.

Consider doing a join against 2 tables that are each 8GB in size
(e.g. - they consist of 8 data files). Let's assume that the query
plan indicates doing seq scans on both.

You *know* you'll be reading through 16 files, each 1GB in size.
Spawning a process for each of those files doesn't strike me as
"prohibitively expensive."

A naive read on this is that you might start with one backend process,
which then spawns 16 more. Each of those backends is scanning through
one of those 16 files; they then throw relevant tuples into shared
memory to be aggregated/joined by the central one.

That particular scenario is one where the fork()s would hardly be
noticeable.

> FWIW, DB2 executes all queries in a dedicated set of processes. The
> process handling the connection from the client will pass a query
> request off to one of the executor processes. I can't remember which
> process actually plans the query, but I know that the executor runs
> it.

It seems to me that the kinds of cases where extra processes/threads
would be warranted are quite likely to be cases where fork()ing may be
an immaterial cost.
--
let name="cbbrowne" and tld="ntlug.org" in String.concat "@" [name;tld];;
http://www.ntlug.org/~cbbrowne/languages.html
TECO Madness: a moment of convenience, a lifetime of regret.
-- Dave Moon

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Chris Browne 2006-03-24 18:24:09 Re: Scaling up PostgreSQL in Multiple CPU / Dual Core
Previous Message Kris Jurka 2006-03-24 18:09:26 Re: [PERFORM] WAL logging of SELECT ... INTO command