Quick Links

Re: using custom scan nodes to prototype parallel sequential scan

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: using custom scan nodes to prototype parallel sequential scan
Date:	2014-11-12 00:54:21
Message-ID:	CA+Tgmoa7Rr-Z7yrnDG=1ZN1ta3ES+foh9Bh3mDNukQj3bKrY=g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> * only functions marked as "CONTAINS NO SQL"
> We don't really know what proisparallel is, but we do know what
> CONTAINS NO SQL means and can easily check for it.
> Plus I already have a patch for this, slightly bitrotted.

Interestingly, I have a fairly solid idea of what proisparallel is,
but I have no clear idea what CONTAINS NO SQL is or why it's relevant.
I would imagine that srandom() contains no SQL under any reasonable
definition of what that means, but it ain't parallel-safe.

> * parallel_workers = 2 (or at least not make it user settable)
> By fixing the number of workers at 2 we avoid any problems caused by
> having N variable, such as how to vary N fairly amongst users and
> other such considerations. We get the main benefit of parallelism,
> without causing other issues across the server.

I think this is a fairly pointless restriction. The code
simplification we'll get out of it appears to me to be quite minor,
and we'll just end up putting the stuff back in anyway.

> * Fixed Plan: aggregate-scan
> To make everything simpler, allow only plans of a single type.
> SELECT something, list of aggregates
> FROM foo
> WHERE filters
> GROUP BY something
> because we know that passing large amounts of data from worker to
> master process will be slow, so focusing only on seq scan is not
> sensible; we should focus on plans that significantly reduce the
> number of rows passed upwards. We could just do this for very
> selective WHERE clauses, but that is not an important class of query.
> As soon as include aggregates, we reduce data passing significantly
> AND we hit a very important subset of queries:

This is moving the goalposts in a way that I'm not at all comfortable
with. Parallel sequential-scan is pretty simple and may well be a win
if there's a restrictive filter condition involved. Parallel
aggregation requires introducing new infrastructure into the aggregate
machinery to allow intermediate state values to be combined, and that
would be a great project for someone to do at some time, but it seems
like a distraction for me to do that right now.

> This plan type is widely used in reporting queries, so will hit the
> mainline of BI applications and many Mat View creations.
> This will allow SELECT count(*) FROM foo to go faster also.
>
> The execution plan for that query type looks like this...
> Hash Aggregate
> Gather From Workers
> {Worker Nodes workers = 2
> HashAggregate
> PartialScan}

I'm going to aim for the simpler:

Hash Aggregate
-> Parallel Seq Scan
Workers: 4

Yeah, I know that won't perform as well as what you're proposing, but
I'm fairly sure it's simpler.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: using custom scan nodes to prototype parallel sequential scan at 2014-11-11 08:29:48 from Simon Riggs

Responses

Re: using custom scan nodes to prototype parallel sequential scan at 2014-11-14 00:15:25 from Simon Riggs
Re: using custom scan nodes to prototype parallel sequential scan at 2014-11-14 00:27:07 from Simon Riggs

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2014-11-12 05:32:44	Re: [REVIEW] Re: Compression of full-page-writes
Previous Message	Kouhei Kaigai	2014-11-12 00:48:51	Re: using custom scan nodes to prototype parallel sequential scan