Quick Links

Re: Gather performance analysis

From:	Zhihong Yu <zyu(at)yugabyte(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Gather performance analysis
Date:	2021-08-28 11:33:59
Message-ID:	CALNJ-vRdkFLBPJYf-x6vW7RPrae2oviR+tGJVenTAUQk3R+jDg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sat, Aug 28, 2021 at 4:29 AM Zhihong Yu <zyu(at)yugabyte(dot)com> wrote:

>
>
> On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com>
> wrote:
>
>> On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas(at)gmail(dot)com>
>> wrote:
>>
>>> On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com>
>>> wrote:
>>> > Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
>>> > 1) Non-parallel (default)
>>> > Execution Time: 31627.492 ms
>>> >
>>> > 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
>>> > Execution Time: 37498.672 ms
>>> >
>>> > 3) Same as above (2) but with the patch.
>>> > Execution Time: 23649.287 ms
>>>
>>> This strikes me as an amazingly good result. I guess before seeing
>>> these results, I would have said that you can't reasonably expect
>>> parallel query to win on a query like this because there isn't enough
>>> for the workers to do. It's not like they are spending time evaluating
>>> filter conditions or anything like that - they're just fetching tuples
>>> off of disk pages and sticking them into a queue. And it's unclear to
>>> me why it should be better to have a bunch of processes doing that
>>> instead of just one. I would have thought, looking at just (1) and
>>> (2), that parallelism gained nothing and communication overhead lost 6
>>> seconds.
>>>
>>> But what this suggests is that parallelism gained at least 8 seconds,
>>> and communication overhead lost at least 14 seconds. In fact...
>>>
>>
>> Right, good observation.
>>
>>
>>> > - If I apply both Experiment#1 and Experiment#2 patches together then,
>>> > we can further reduce the execution time to 20963.539 ms (with 4
>>> > workers and 4MB tuple queue size)
>>>
>>> ...this suggests that parallelism actually gained at least 10-11
>>> seconds, and the communication overhead lost at least 15-16 seconds.
>>>
>>
>> Yes
>>
>>
>>> If that's accurate, it's pretty crazy. We might need to drastically
>>> reduce the value of parallel_tuple_cost if these results hold up and
>>> this patch gets committed.
>>>
>>
>> In one of my experiments[Test1] I have noticed that even on the head the
>> force parallel plan is significantly faster compared to the non-parallel
>> plan, but with patch it is even better. The point is now also there might
>> be some cases where the force parallel plans are faster but we are not sure
>> whether we can reduce the parallel_tuple_cost or not. But with the patch
>> it is definitely sure that the parallel tuple queue is faster compared to
>> what we have now, So I agree we should consider reducing the
>> parallel_tuple_cost after this patch.
>>
>> Additionally, I've done some more experiments with artificial workloads,
>> as well as workloads where the parallel plan is selected by default, and in
>> all cases I've seen a significant improvement. The gain is directly
>> proportional to the load on the tuple queue, as expected.
>>
>> Test1: (Worker returns all tuples but only few tuples returns to the
>> client)
>> ----------------------------------------------------
>> INSERT INTO t SELECT i%10, repeat('a', 200) from
>> generate_series(1,200000000) as i;
>> set max_parallel_workers_per_gather=4;
>>
>> Target Query: SELECT random() FROM t GROUP BY a;
>>
>> Non-parallel (default plan): 77170.421 ms
>> Parallel (parallel_tuple_cost=0): 53794.324 ms
>> Parallel with patch (parallel_tuple_cost=0): 42567.850 ms
>>
>> 20% gain compared force parallel, 45% gain compared to default plan.
>>
>> Test2: (Parallel case with default parallel_tuple_cost)
>> ----------------------------------------------
>> INSERT INTO t SELECT i, repeat('a', 200) from
>> generate_series(1,200000000) as i;
>>
>> set max_parallel_workers_per_gather=4;
>> SELECT * from t WHERE a < 17500000;
>> Parallel(default plan): 23730.054 ms
>> Parallel with patch (default plan): 21614.251 ms
>>
>> 8 to 10 % gain compared to the default parallel plan.
>>
>> I have done cleanup in the patch and I will add this to the September
>> commitfest.
>>
>> I am planning to do further testing for identifying the optimal batch
>> size in different workloads. WIth above workload I am seeing similar
>> results with batch size 4k to 16k (1/4 of the ring size) so in the attached
>> patch I have kept as 1/4 of the ring size. We might change that based on
>> more analysis and testing.
>>
>> --
>> Regards,
>> Dilip Kumar
>> EnterpriseDB: http://www.enterprisedb.com
>>
> Hi,
> Some minor comments.
> For shm_mq.c, existing comment says:
>
> * mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete
>
> + Size mqh_send_pending;
> bool mqh_length_word_complete;
> bool mqh_counterparty_attached;
>
> I wonder if mqh_send_pending should be declared
> after mqh_length_word_complete - this way, the order of fields matches the
> order of explanation for the fields.
>
> + if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)
>
> The above can be written as:
>
> + if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))
>
> so that when force_flush is true, the other condition is not evaluated.
>
> Cheers
>

There was a typo in suggested code above. It should be:

+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2))

Cheers

In response to

Re: Gather performance analysis at 2021-08-28 11:29:08 from Zhihong Yu

Responses

Re: Gather performance analysis at 2021-09-16 07:25:06 from Dilip Kumar

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bharath Rupireddy	2021-08-28 12:10:25	Re: pg_receivewal starting position
Previous Message	Zhihong Yu	2021-08-28 11:29:08	Re: Gather performance analysis