Re: parallelizing subplan execution

From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: parallelizing subplan execution
Date: 2010-02-21 07:37:53
Message-ID: 4B80E2D1.2040307@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas wrote:
>
>
> It seems to me that you need to start by thinking about what kinds of
> queries could be usefully parallelized. What I think you're proposing
> here, modulo large amounts of hand-waving, is that we should basically
> find a branch of the query tree, cut it off, and make that branch the
> responsibility of a subprocess. What kinds of things would be
> sensible to hand off in this way? Well, you'd want to find nodes that
> are not likely to be repeatedly re-executed with different parameters,
> like subplans or inner-indexscans, because otherwise you'll get
> pipeline stalls handing the new parameters back and forth. And you
> want to find nodes that are expensive for the same reason. So maybe
> this would work for something like a merge join on top of two sorts -
> one backend could perform each sort, and then whichever one was the
> child would stream the tuples to the parent for the final merge. Of
> course, this assumes the I/O subsystem can keep up, which is not a
> given - if both tables are fed by the same, single spindle, it might
> be worse than if you just did the sorts consecutively.
>
> This approach might also benefit queries that are very CPU-intensive,
> on a multi-core system with spare cycles. Suppose you have a big tall
> stack of hash joins, each with a small inner rel. The child process
> does about half the joins and then pipelines the results into the
> parent, which does the other half and returns the results.
>
> But there's at least one other totally different way of thinking about
> this problem, which is that you might want two processes to cooperate
> in executing the SAME query node - imagine, for example, a big
> sequential scan with an expensive but highly selective filter
> condition, or an enormous sort. You have all the same problems of
> figuring out when it's actually going to help, of course, but the
> details will likely be quite different.
>
> I'm not really sure which one of these would be more useful in
> practice - or maybe there are even other strategies. What does
> $COMPETITOR do?
>
> I'm also ignoring the difficulties of getting hold of a second backend
> in the right state - same database, same snapshot, etc. It seems to
> me unlikely that there are a substantial number of real-world
> applications for which this will not work very well if we have to
> actually start a new backend every time we want to parallelize a
> query. IOW, we're going to need, well, a connection pool in core.
> *ducks, runs for cover*
>
>

One thing that might work quite well is slicing up by partition
(properly implemented partitioning would go along with this nicely too...)

regards

Mark

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2010-02-21 07:44:58 Re: scheduler in core
Previous Message Robert Haas 2010-02-21 04:08:58 Re: PGXS: REGRESS_OPTS=--load-language=plpgsql