Re: Parallel query execution

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel query execution
Date: 2013-01-15 23:15:57
Message-ID: 20130115231557.GB16126@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

* Gavin Flower (GavinFlower(at)archidevsys(dot)co(dot)nz) wrote:
> How about being aware of multiple spindles - so if the requested
> data covers multiple spindles, then data could be extracted in
> parallel. This may, or may not, involve multiple I/O channels?

Yes, this should dovetail with partitioning and tablespaces to pick up
on exactly that. We're implementing our own poor-man's parallelism
using exactly this to use as much of the CPU and I/O bandwidth as we
can. I have every confidence that it could be done better and be
simpler for us if it was handled in the backend.

> On large multiple processor machines, there are different blocks of
> memory that might be accessed at different speeds depending on the
> processor. Possibly a mechanism could be used to split a transaction
> over multiple processors to ensure the fastest memory is used?

Let's work on getting it working on the h/w that PG is most commonly
deployed on first.. I agree that we don't want to paint ourselves into
a corner with this, but I don't think massive NUMA systems are what we
should focus on first (are you familiar with any that run PG today..?).
I don't expect we're going to be trying to fight with the Linux (or
whatever) kernel over what threads run on what processors with access to
what memory on small-NUMA systems (x86-based).

> Once a selection of rows has been made, then if there is a lot of
> reformatting going on, then could this be done in parallel? I can
> of think of 2 very simplistic strategies: (A) use a different
> processor core for each column, or (B) farm out sets of rows to
> different cores. I am sure in reality, there are more subtleties
> and aspects of both the strategies will be used in a hybrid fashion
> along with other approaches.

Given our row-based storage architecture, I can't imagine we'd do
anything other than take a row-based approach to this.. I would think
we'd do two things: parallelize based on partitioning, and parallelize
seqscan's across the individual heap files which are split on a per-1G
boundary already. Perhaps we can generalize that and scale it based on
the number of available processors and the size of the relation but I
could see advantages in matching up with what the kernel thinks are
independent files.

> I expect that before any parallel algorithm is invoked, then some
> sort of threshold needs to be exceeded to make it worth while.

Certainly. That's need to be included in the optimization model to
support this.

Thanks,

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-01-15 23:17:05 Re: [PATCH] COPY .. COMPRESSED
Previous Message Bruce Momjian 2013-01-15 23:08:47 Re: Parallel query execution