Re: Parallel Seq Scan

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan
Date: 2014-12-06 12:07:17
Message-ID: 20141206120717.GZ25679@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

* Amit Kapila (amit(dot)kapila16(at)gmail(dot)com) wrote:
> 1. As the patch currently stands, it just shares the relevant
> data (like relid, target list, block range each worker should
> perform on etc.) to the worker and then worker receives that
> data and form the planned statement which it will execute and
> send the results back to master backend. So the question
> here is do you think it is reasonable or should we try to form
> the complete plan for each worker and then share the same
> and may be other information as well like range table entries
> which are required. My personal gut feeling in this matter
> is that for long term it might be better to form the complete
> plan of each worker in master and share the same, however
> I think the current way as done in patch (okay that needs
> some improvement) is also not bad and quite easier to implement.

For my 2c, I'd like to see it support exactly what the SeqScan node
supports and then also what Foreign Scan supports. That would mean we'd
then be able to push filtering down to the workers which would be great.
Even better would be figuring out how to parallelize an Append node
(perhaps only possible when the nodes underneath are all SeqScan or
ForeignScan nodes) since that would allow us to then parallelize the
work across multiple tables and remote servers.

One of the big reasons why I was asking about performance data is that,
today, we can't easily split a single relation across multiple i/o
channels. Sure, we can use RAID and get the i/o channel that the table
sits on faster than a single disk and possibly fast enough that a single
CPU can't keep up, but that's not quite the same. The historical
recommendations for Hadoop nodes is around one CPU per drive (of course,
it'll depend on workload, etc, etc, but still) and while there's still a
lot of testing, etc, to be done before we can be sure about the 'right'
answer for PG (and it'll also vary based on workload, etc), that strikes
me as a pretty reasonable rule-of-thumb to go on.

Of course, I'm aware that this won't be as easy to implement..

> 2. Next question related to above is what should be the
> output of ExplainPlan, as currently worker is responsible
> for forming its own plan, Explain Plan is not able to show
> the detailed plan for each worker, is that okay?

I'm not entirely following this. How can the worker be responsible for
its own "plan" when the information passed to it (per the above
paragraph..) is pretty minimal? In general, I don't think we need to
have specifics like "this worker is going to do exactly X" because we
will eventually need some communication to happen between the worker and
the master process where the worker can ask for more work because it's
finished what it was tasked with and the master will need to give it
another chunk of work to do. I don't think we want exactly what each
worker process will do to be fully formed at the outset because, even
with the best information available, given concurrent load on the
system, it's not going to be perfect and we'll end up starving workers.
The plan, as formed by the master, should be more along the lines of
"this is what I'm gonna have my workers do" along w/ how many workers,
etc, and then it goes and does it. Perhaps for an 'explain analyze' we
return information about what workers actually *did* what, but that's a
whole different discussion.

> 3. Some places where optimizations are possible:
> - Currently after getting the tuple from heap, it is deformed by
> worker and sent via message queue to master backend, master
> backend then forms the tuple and send it to upper layer which
> before sending it to frontend again deforms it via slot_getallattrs(slot).

If this is done as I was proposing above, we might be able to avoid
this, but I don't know that it's a huge issue either way.. The bigger
issue is getting the filtering pushed down.

> - Master backend currently receives the data from multiple workers
> serially. We can optimize in a way that it can check other queues,
> if there is no data in current queue.

Yes, this is pretty critical. In fact, it's one of the recommendations
I made previously about how to change the Append node to parallelize
Foreign Scan node work.

> - Master backend is just responsible for coordination among workers
> It shares the required information to workers and then fetch the
> data processed by each worker, by using some more logic, we might
> be able to make master backend also fetch data from heap rather than
> doing just co-ordination among workers.

I don't think this is really necessary...

> I think in all above places we can do some optimisation, however
> we can do that later as well, unless they hit the performance badly for
> cases which people care most.

I agree that we can improve the performance through various
optimizations later, but it's important to get the general structure and
design right or we'll end up having to reimplement a lot of it.

> 4. Should parallel_seqscan_degree value be dependent on other
> backend processes like MaxConnections, max_worker_processes,
> autovacuum_max_workers do or should it be independent like
> max_wal_senders?

Well, we're not going to be able to spin off more workers than we have
process slots, but I'm not sure we need anything more than that? In any
case, this is definitely an area we can work on improving later and I
don't think it really impacts the rest of the design.

Thanks,

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2014-12-06 13:08:28 Re: intel s3500 -- hot stuff
Previous Message Amit Kapila 2014-12-06 12:01:07 Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]