Re: Gather performance analysis

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Gather performance analysis
Date: 2021-08-28 07:10:39
Message-ID: CAFiTN-u2McrrYTpR_X4cqU04X=sX-g6pn4VH0=oR-0aoqtpMEg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
> > 1) Non-parallel (default)
> > Execution Time: 31627.492 ms
> >
> > 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
> > Execution Time: 37498.672 ms
> >
> > 3) Same as above (2) but with the patch.
> > Execution Time: 23649.287 ms
>
> This strikes me as an amazingly good result. I guess before seeing
> these results, I would have said that you can't reasonably expect
> parallel query to win on a query like this because there isn't enough
> for the workers to do. It's not like they are spending time evaluating
> filter conditions or anything like that - they're just fetching tuples
> off of disk pages and sticking them into a queue. And it's unclear to
> me why it should be better to have a bunch of processes doing that
> instead of just one. I would have thought, looking at just (1) and
> (2), that parallelism gained nothing and communication overhead lost 6
> seconds.
>
> But what this suggests is that parallelism gained at least 8 seconds,
> and communication overhead lost at least 14 seconds. In fact...
>

Right, good observation.

> > - If I apply both Experiment#1 and Experiment#2 patches together then,
> > we can further reduce the execution time to 20963.539 ms (with 4
> > workers and 4MB tuple queue size)
>
> ...this suggests that parallelism actually gained at least 10-11
> seconds, and the communication overhead lost at least 15-16 seconds.
>

Yes

> If that's accurate, it's pretty crazy. We might need to drastically
> reduce the value of parallel_tuple_cost if these results hold up and
> this patch gets committed.
>

In one of my experiments[Test1] I have noticed that even on the head the
force parallel plan is significantly faster compared to the non-parallel
plan, but with patch it is even better. The point is now also there might
be some cases where the force parallel plans are faster but we are not sure
whether we can reduce the parallel_tuple_cost or not. But with the patch
it is definitely sure that the parallel tuple queue is faster compared to
what we have now, So I agree we should consider reducing the
parallel_tuple_cost after this patch.

Additionally, I've done some more experiments with artificial workloads, as
well as workloads where the parallel plan is selected by default, and in
all cases I've seen a significant improvement. The gain is directly
proportional to the load on the tuple queue, as expected.

Test1: (Worker returns all tuples but only few tuples returns to the client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from
generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;

Target Query: SELECT random() FROM t GROUP BY a;

Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms

20% gain compared force parallel, 45% gain compared to default plan.

Test2: (Parallel case with default parallel_tuple_cost)
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000)
as i;

set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms

8 to 10 % gain compared to the default parallel plan.

I have done cleanup in the patch and I will add this to the September
commitfest.

I am planning to do further testing for identifying the optimal batch size
in different workloads. WIth above workload I am seeing similar results
with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I
have kept as 1/4 of the ring size. We might change that based on more
analysis and testing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch text/x-patch 10.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Trafalgar Ricardo Lu 2021-08-28 08:39:37 Summary of GSoC 2021
Previous Message hubert depesz lubaczewski 2021-08-28 06:46:50 Re: Can we get rid of repeated queries from pg_dump?