Re: Parallel Seq Scan

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan
Date: 2014-12-21 06:42:04
Message-ID: CAA4eK1Jf7ORdkDYFTNC6kYe7zcn5-49c_SztVdJ=1HC7KcdjMQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 19, 2014 at 6:21 PM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
>
> Amit,
>
> * Amit Kapila (amit(dot)kapila16(at)gmail(dot)com) wrote:
> > 1. Parallel workers help a lot when there is an expensive qualification
> > to evaluated, the more expensive the qualification the more better are
> > results.
>
> I'd certainly hope so. ;)
>
> > 2. It works well for low selectivity quals and as the selectivity
increases,
> > the benefit tends to go down due to additional tuple communication cost
> > between workers and master backend.
>
> I'm a bit sad to hear that the communication between workers and the
> master backend is already being a bottleneck. Now, that said, the box
> you're playing with looks to be pretty beefy and therefore the i/o
> subsystem might be particularly good, but generally speaking, it's a lot
> faster to move data in memory than it is to pull it off disk, and so I
> wouldn't expect the tuple communication between processes to really be
> the bottleneck...
>

The main reason for higher cost of tuple communication is because at
this moment I have used an approach to pass the tuples which is
comparatively
less error prone and could be used as per existing FE/BE protocol.

To explain in brief, what is happening here is that currently worker backend
gets the tuple from page which it is deforms and send the same to master
backend via message queue, master backend then forms the tuple and send it
to upper layer which before sending it to frontend again deforms it via
slot_getallattrs(slot). The benefit of using this approach is that it works
as per current protocol message ('D') and as per our current executor code.

Now there could be couple of ways with which we can reduce the tuple
communication overhead.

a. Instead of passing value array, just pass tuple id, but retain the
buffer pin till master backend reads the tuple based on tupleid.
This has side effect that we have to retain buffer pin for longer
period of time, but again that might not have any problem in
real world usage of parallel query.

b. Instead of passing value array, pass directly the tuple which could
be directly propagated by master backend to upper layer or otherwise
in master backend change some code such that it could propagate the
tuple array received via shared memory queue directly to frontend.
Basically save the one extra cycle of form/deform tuple.

Both these need some new message type and handling for same in
Executor code.

Having said above, I think we can try to optimize this in multiple
ways, however we need additional mechanism and changes in Executor
code which is error prone and doesn't seem to be important at this
stage where we want the basic feature to work.

> > 3. After certain point, increasing having more number of workers won't
> > help and rather have negative impact, refer Test-4.
>
> Yes, I see that too and it's also interesting- have you been able to
> identify why? What is the overhead (specifically) which is causing
> that?
>

I think there are mainly two things which can lead to benefit
by employing parallel workers
a. Better use of available I/O bandwidth
b. Better use of available CPU's by doing expression evaluation
by multiple workers.

The simple theory here is that there has to be certain limit
(in terms of number of parallel workers) till which there can
be benefit due to both of the above points and after which there
will be overhead (setting up so many workers even though they
are not required, then some additional wait by master backend
for non-helping workers to finish their work, then if there
are not enough CPU's available and may be others as well like
overusing I/O channel might also degrade the performance
rather than improving it).

In the above tests, it seems to me that the maximum benefit due to
'a' is realized upto 4~8 workers and the maximum benefit due to
'b' depends upon the complexity (time to evaluate) of expression.
That is the reason why we can see benefit's in Tests-1 ~ Test-3 above
8 parallel workers as well whereas for Tests-4 to Tests-6 it maximizes
at 8 workers and after that either there is no improvement or
degradation due to one or more reasons as explained in previous
paragraph.

I think important point which is mentioned by you as well is
that there should be a reasonably good cost model which can
account some or all of these things so that by using parallel
query user can achieve the benefit it provides and won't have
to pay the cost in which there is no or less benefit.
I am not sure that in first cut we can come up with a highly
robust cost model, but it should not be too weak that most
of the time user has to find the right tuning based on parameters
we are going to add. Based on my understanding and by referring
to existing literature, I will try to come up with the cost model
and then we can have a discussion if required whether that is good
enough for first cut or not.

> > I think as discussed previously we need to introduce 2 additional cost
> > variables (parallel_startup_cost, cpu_tuple_communication_cost) to
> > estimate the parallel seq scan cost so that when the tables are small
> > or selectivity is high, it should increase the cost of parallel plan.
>
> I agree that we need to figure out a way to cost out parallel plans, but
> I have doubts about these being the right way to do that. There has
> been quite a bit of literature regarding parallel execution and
> planning- have you had a chance to review anything along those lines?

Not now, but sometime back I had read quite a few papers on parallelism,
I will refer some of them again before deciding the exact cost model
and might as well discuss about them.

> We certainly like to draw on previous experiences and analysis rather
> than trying to pave our own way.
>
> With these additional costs comes the consideration that we're looking
> for a wall-clock runtime proxy and therefore, while we need to add costs
> for parallel startup and tuple communication, we have to reduce the
> overall cost because of the parallelism or we'd never end up choosing a
> parallel plan. Is the thought to simply add up all the costs and then
> divide? Or perhaps to divide the cost of the actual plan but then add
> in the parallel startup cost and the tuple communication cost?
>
> Perhaps there has been prior discussion on these points but I'm thinking
> we need a README or similar which discusses all of this and includes any
> references out to academic papers or similar as appropriate.
>

Got the point, I think we need to mention somewhere either in README or
in some file header.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2014-12-21 13:38:24 Re: PATCH: decreasing memory needlessly consumed by array_agg
Previous Message Alvaro Herrera 2014-12-21 01:54:16 Re: PATCH: decreasing memory needlessly consumed by array_agg