Re: Parallel Seq Scan

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Fabrízio Mello <fabriziomello(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan
Date: 2015-04-08 03:58:49
Message-ID: CAA4eK1JmbQNDSHErHs3Ord6qZeU50zmhkj9tj83SGMh7F7pxMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 8, 2015 at 7:54 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> I agree that this is an area that needs more thought. I don't
> (currently, anyway) agree that the planner shouldn't know anything
> about parallelism. The problem with that is that there's lots of
> relevant stuff that can only be known at plan time. For example,
> consider the query you mention above on a table with no index. If the
> WHERE clause is highly selective, a parallel plan may well be best.
> But if the selectivity is only, say, 50%, a parallel plan is stupid:
> the IPC costs of shipping many rows back to the master will overwhelm
> any benefit we could possibly have hoped to get, and the overall
> result will likely be that the parallel plan both runs slower and uses
> more resources. At plan time, we have the selectivity information
> conveniently at hand, and can use that as part of the cost model to
> make educated decisions. Execution time is way too late to be
> thinking about those kinds of questions.
>
> I think one of the philosophical questions that has to be answered
> here is "what does it mean to talk about the cost of a parallel
> plan?". For a non-parallel plan, the cost of the plan means both "the
> amount of effort we will spend executing the plan" and also "the
> amount of time we think the plan will take to complete", but those two
> things are different for parallel plans. I'm inclined to think it's
> right to view the cost of a parallel plan as a proxy for execution
> time, because the fundamental principle of the planner is that we pick
> the lowest-cost plan. But there also clearly needs to be some way to
> prevent the selection of a plan which runs slightly faster at the cost
> of using vastly more resources.
>
> Currently, the planner tracks the best unsorted path for each relation
> as well as the best path for each useful sort order. Suppose we treat
> parallelism as another axis for judging the quality of a plan: we keep
> the best unsorted, non-parallel path; the best non-parallel path for
> each useful sort order; the best unsorted, parallel path; and the best
> parallel path for each sort order. Each time we plan a node, we
> generate non-parallel paths first, and then parallel paths. But, if a
> parallel plan isn't markedly faster than the non-parallel plan for the
> same sort order, then we discard it.

One disadvantage of retaining parallel-paths could be that it can
increase the number of combinations planner might need to evaluate
during planning (in particular during join path evaluation) unless we
do some special handling to avoid evaluation of such combinations.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2015-04-08 04:09:04 Re: Proposal : REINDEX xxx VERBOSE
Previous Message sudalai 2015-04-08 03:49:21 Re: File count restriction of directory limits number of relations inside a database.