Re: Parallel Seq Scan

From: John Gorman <johngorman2(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan
Date: 2015-01-13 12:08:41
Message-ID: CALkS6B8-A8uSG0J9a1fiGS_Q1BnL3aqovdZXYJKSeFLJZQb0Tw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 13, 2015 at 7:25 AM, John Gorman <johngorman2(at)gmail(dot)com> wrote:

>
>
> On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
>
>> On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost <sfrost(at)snowman(dot)net>
>> wrote:
>> > So, for my 2c, I've long expected us to parallelize at the relation-file
>> > level for these kinds of operations. This goes back to my other
>> > thoughts on how we should be thinking about parallelizing inbound data
>> > for bulk data loads but it seems appropriate to consider it here also.
>> > One of the issues there is that 1G still feels like an awful lot for a
>> > minimum work size for each worker and it would mean we don't parallelize
>> > for relations less than that size.
>>
>> Yes, I think that's a killer objection.
>
>
> One approach that I has worked well for me is to break big jobs into much
> smaller bite size tasks. Each task is small enough to complete quickly.
>
> We add the tasks to a task queue and spawn a generic worker pool which
> eats through the task queue items.
>
> This solves a lot of problems.
>
> - Small to medium jobs can be parallelized efficiently.
> - No need to split big jobs perfectly.
> - We don't get into a situation where we are waiting around for a worker
> to finish chugging through a huge task while the other workers sit idle.
> - Worker memory footprint is tiny so we can afford many of them.
> - Worker pool management is a well known problem.
> - Worker spawn time disappears as a cost factor.
> - The worker pool becomes a shared resource that can be managed and
> reported on and becomes considerably more predictable.
>
>
I forgot to mention that a running task queue can provide metrics such as
current utilization, current average throughput, current queue length and
estimated queue wait time. These can become dynamic cost factors in
deciding whether to parallelize.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexey Bashtanov 2015-01-13 12:08:45 OOM on EXPLAIN with lots of nodes
Previous Message Gabriele Bartolini 2015-01-13 11:53:35 Re: [RFC] Incremental backup v3: incremental PoC