Re: Small improvement to parallel query docs

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Small improvement to parallel query docs
Date: 2017-02-13 20:21:22
Message-ID: CA+TgmoaAy9MQv2h4OwZ-rMcx9KNh_3eG24q61_ua061nSEDX+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 12, 2017 at 7:16 PM, David Rowley
<david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> Tomas Vondra pointed out to me that there's an error in parallel.sgml
> which confuses the inner and outer sides of the join.
>
> I've attached a patch which fixes this, although I think I'm still
> missing the point to text's explanation of why Merge Join is not
> included due to it having to sort the inner side in each worker. Hash
> join must build a hash table for each worker, so why is that OK by
> sorting is not?
>
> Anyway, I've attached a patch which fixes the outer/inner confusion
> and cleans up a couple of other grammar mistakes and an omissions
> regarding DISTINCT and ORDER BY not being supported in parallel
> aggregate. I ended up rewriting a section too which was explaining
> parallel aggregate, which I personally believe is a bit more clear to
> read, but someone else may think otherwise.

- loops or hash joins. The outer side of the join may be any kind of
+ loops or hash joins. The inner side of the join may be any kind of

Oops. That's clearly a mistake.

- table. Each worker will execute the outer side of the plan in full, which
- is why merge joins are not supported here. The outer side of a merge join
- will often involve sorting the entire inner table; even if it involves an
- index, it is unlikely to be productive to have multiple processes each
- conduct a full index scan of the inner table.
+ relation. Each worker will execute the inner side of the join in full,
+ which is why merge joins are not supported here. The inner side of a merge
+ join will often involve sorting the entire inner relation; even if it
+ involves an index, it is unlikely to be productive to have multiple
+ processes each conduct a full index scan of the inner side of the join.

Why s/table/relation/? I don't think that's useful, especially
because the first part of that very same paragraph would still say
"table".

Anyway, events have shown the bit about merge joins was wrong thinking
on my part. See Dilip's emails about parallel merge join. Maybe we
should just change to say that it isn't supported without giving a
reason. I hope to commit Dilip's patch to add support for exactly
that thing soon. Then we can remove the language altogether and say
that it is supported.

- <literal>COUNT(*)</>, each worker could compute a total, but those totals
- would need to combined in order to produce a final answer. If the query
- involved a <literal>GROUP BY</> clause, a separate total would need to
- be computed for each group. Even though aggregation can't be done entirely
- in parallel, queries involving aggregation are often excellent candidates
- for parallel query, because they typically read many rows but return only
- a few rows to the client. Queries that return many rows to the client
- are often limited by the speed at which the client can read the data,
- in which case parallel query cannot help very much.
+ <literal>COUNT(*)</>, each worker must compute subtotals which later must
+ be combined to produce an overall total in order to produce the final
+ answer. If the query involves a <literal>GROUP BY</> clause,
+ separate subtotals must be computed for each group seen by each parallel
+ worker. Each of these subtotals must then be combined into an overall
+ total for each group once the parallel aggregate portion of the plan is
+ complete. This means that queries which produce a low number of groups
+ relative to the number of input rows are often far more attractive to the
+ query planner, whereas queries which don't collect many rows into each
+ group are less attractive, due to the overhead of having to combine the
+ subtotals into totals, of which cannot run in parallel.

I don't think "of which cannot run in parallel" is good grammar. I'm
somewhat unsure whether the rest is an improvement or not. Other
opinions?

- a <literal>PartialAggregate</> node. Second, the partial results are
+ a <literal>Partial Aggregate</> node. Second, the partial results are

Oops, that's clearly a mistake.

- <literal>FinalizeAggregate</> node.
+ <literal>Finalize Aggregate</> node.

That one, too.

- Parallel aggregation is not supported for ordered set aggregates or when
- the query involves <literal>GROUPING SETS</>. It can only be used when
- all joins involved in the query are also part of the parallel portion
- of the plan.
+ Parallel aggregation is not supported if any aggregate function call
+ contains <literal>DISTINCT</> or <literal>ORDER BY</> clause and is also
+ not supported for ordered set aggregates or when the query involves
+ <literal>GROUPING SETS</>. It can only be used when all joins involved in
+ the query are also part of the parallel portion of the plan.

That chunk seems like an improvement to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2017-02-13 20:29:33 Re: Small improvement to parallel query docs
Previous Message Corey Huinker 2017-02-13 20:18:38 Re: \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)