Re: Parallel query execution

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel query execution
Date: 2013-01-15 23:08:47
Message-ID: 20130115230847.GB32658@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 16, 2013 at 12:03:50PM +1300, Gavin Flower wrote:
> On 16/01/13 11:14, Bruce Momjian wrote:
>
> I mentioned last year that I wanted to start working on parallelism:
>
> https://wiki.postgresql.org/wiki/Parallel_Query_Execution
>
> Years ago I added thread-safety to libpq. Recently I added two parallel
> execution paths to pg_upgrade. The first parallel path allows execution
> of external binaries pg_dump and psql (to restore). The second parallel
> path does copy/link by calling fork/thread-safe C functions. I was able
> to do each in 2-3 days.
>
> I believe it is time to start adding parallel execution to the backend.
> We already have some parallelism in the backend:
> effective_io_concurrency and helper processes. I think it is time we
> start to consider additional options.
>
> Parallelism isn't going to help all queries, in fact it might be just a
> small subset, but it will be the larger queries. The pg_upgrade
> parallelism only helps clusters with multiple databases or tablespaces,
> but the improvements are significant.
>
> I have summarized my ideas by updating our Parallel Query Execution wiki
> page:
>
> https://wiki.postgresql.org/wiki/Parallel_Query_Execution
>
> Please consider updating the page yourself or posting your ideas to this
> thread. Thanks.
>
>
> Hmm...
>
> How about being aware of multiple spindles - so if the requested data covers
> multiple spindles, then data could be extracted in parallel. This may, or may
> not, involve multiple I/O channels?

Well, we usually label these as tablespaces. I don't know if
spindle-level is a reasonable level to add.

> On large multiple processor machines, there are different blocks of memory that
> might be accessed at different speeds depending on the processor. Possibly a
> mechanism could be used to split a transaction over multiple processors to
> ensure the fastest memory is used?

That seems too far-out for an initial approach.

> Once a selection of rows has been made, then if there is a lot of reformatting
> going on, then could this be done in parallel? I can of think of 2 very
> simplistic strategies: (A) use a different processor core for each column, or
> (B) farm out sets of rows to different cores. I am sure in reality, there are
> more subtleties and aspects of both the strategies will be used in a hybrid
> fashion along with other approaches.

Probably #2, but that is going to require having some of modules
thread/fork-safe, and that is going to be tricky.

> I expect that before any parallel algorithm is invoked, then some sort of
> threshold needs to be exceeded to make it worth while. Different aspects of
> the parallel algorithm may have their own thresholds. It may not be worth
> applying a parallel algorithm for 10 rows from a simple table, but selecting
> 10,000 records from multiple tables each over 10 million rows using joins may
> benefit for more extreme parallelism.

Right, I bet we will need some way to control when the overhead of
parallel execution is worth it.

> I expect that UNIONs, as well as the processing of partitioned tables, may be
> amenable to parallel processing.

Interesting idea on UNION.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2013-01-15 23:15:57 Re: Parallel query execution
Previous Message Stephen Frost 2013-01-15 23:07:01 Re: [PATCH] COPY .. COMPRESSED