Re: a funnel by any other name

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: a funnel by any other name
Date: 2015-09-22 14:34:11
Message-ID: CANP8+jK6SLnND6tGwNdpkw=h_SyoCt8Nd5521AOyA50M9NrNsg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 17 September 2015 at 05:07, Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
wrote:

> 2015-09-17 Robert Haas <robertmhaas(at)gmail(dot)com>:
>
> > 1. Exchange Bushy
> > 2. Exchange Inter-Operator (this is what's currently implemented)
> > 3. Exchange Replicate
> > 4. Exchange Merge
> > 5. Interchange
>
> > 1. ?
> > 2. Gather
> > 3. Broadcast (sorta)
> > 4. Gather Merge
> > 5. Redistribute
>
> > 1. Parallel Child
> > 2. Parallel Gather
> > 3. Parallel Replicate
> > 4. Parallel Merge
> > 5. Parallel Redistribute
>
> FYI, SQL Server has these in its execution plans:
>
> * Distribute Streams: read from one thread, write to multiple threads
> * Repartition Streams: both read and write from/to multiple threads
> * Gather Streams: read from multiple threads, write to one thread
>

Robert, thanks for asking. We'll be stuck with these words for some time,
user visible via EXPLAIN so this is important.

In general we should stick to words already used in other similar
situations, which could include DBMS and parallel ETL tools, of which there
are many more than mentioned here.

I would be against using any of these words: Funnel, Motion, Bushy because
I don't find them very descriptive (I think of spiders, bowels and shrubs
respectively, sorry).

These words are liable to confusion with other concepts: Replicate,
Duplicate, Distribute, Partition, Repartition, MERGE.

I've seen this concept called Fan-In/Fan-Out and Scatter/Gather

The main operations are the 3 mentioned by Nicolas:
1. Send data from many to one - which has subtypes for Unsorted, Sorted and
Evenly balanced (but unsorted)
2. Send data from one process to many
3. Send data from many to many

My preferences for this would be
1. Gather (but not Gather Motion) e.g. Gather, Gather Sorted
2. Scatter (since Broadcast only makes sense in the context of a
distributed query, it sounds weird for intra-node query)
3. Redistribution - which implies the description of how we spread data
across nodes is "Distribution" (or DISTRIBUTED BY)

For 3 we should definitely use Redistribute, since this is what Teradata
has been calling it for 30 years, which is where Greenplum got it from.
For 1, Gather makes most sense.

For 2, it could be either Scatter or Distribute. The former works well with
Gather, the latter works well with Redistribute.

Sorry for my absence for further review on parallel ops.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2015-09-22 14:38:53 Re: row_security GUC, BYPASSRLS
Previous Message Geoff Winkless 2015-09-22 14:29:51 Re: [HACKERS] pgsql: Use gender-neutral language in documentation