Re: [POC] Faster processing at Gather node

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] Faster processing at Gather node
Date: 2017-10-23 12:46:59
Message-ID: CAA4eK1+7ETUDAhdfNF7OdEmAhP4_+yG24Yzg2yGz169DM3PwPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 18, 2017 at 3:09 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hi Everyone,
>
> On 2017-05-19 17:25:38 +0530, Rafia Sabih wrote:
>> While analysing the performance of TPC-H queries for the newly developed
>> parallel-operators, viz, parallel index, bitmap heap scan, etc. we noticed
>> that the time taken by gather node is significant. On investigation, as per
>> the current method it copies each tuple to the shared queue and notifies
>> the receiver. Since, this copying is done in shared queue, a lot of locking
>> and latching overhead is there.
>>
>> So, in this POC patch I tried to copy all the tuples in a local queue thus
>> avoiding all the locks and latches. Once, the local queue is filled as per
>> it's capacity, tuples are transferred to the shared queue. Once, all the
>> tuples are transferred the receiver is sent the notification about the same.
>>
>> With this patch I could see significant improvement in performance for
>> simple queries,
>
> I've spent some time looking into this, and I'm not quite convinced this
> is the right approach.
>

As per my understanding, the patch in this thread is dead (not
required) after the patch posted by Rafia in thread "Effect of
changing the value for PARALLEL_TUPLE_QUEUE_SIZE" [1]. There seem to
be two related problems in this area, first is shm queue communication
overhead and second is workers started to stall when shm queue gets
full. We can observe the first problem in simple queries that use
gather and second in gather merge kind of scenarios. We are trying to
resolve both the problems with the patch posted in another thread. I
think there is some similarity with the patch posted on this thread,
but it is different. I have mentioned something similar up thread as
well.

> Let me start by describing where I see the
> current performance problems around gather stemming from.
>
> The observations here are made using
> select * from t where i < 30000000 offset 29999999 - 1;
> with Rafia's data. That avoids slowdowns on the leader due to too many
> rows printed out. Sometimes I'll also use
> SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 1000000000 LIMIT 1;
> on tpch data to show the effects on wider tables.
>
> The precise query doesn't really matter, the observations here are more
> general, I hope.
>
> 1) nodeGather.c re-projects every row from workers. As far as I can tell
> that's now always exactly the same targetlist as it's coming from the
> worker. Projection capability was added in 8538a6307049590 (without
> checking whether it's needed afaict), but I think it in turn often
> obsoleted by 992b5ba30dcafdc222341505b072a6b009b248a7. My
> measurement shows that removing the projection yields quite massive
> speedups in queries like yours, which is not too surprising.
>
> I suspect this just needs a tlist_matches_tupdesc check + an if
> around ExecProject(). And a test, right now tests pass unless
> force_parallel_mode is used even if just commenting out the
> projection unconditionally.
>

+1. I think we should something to avoid this.

>
> Commenting out the memory barrier - which is NOT CORRECT - improves
> timing:
> before: 4675.626
> after: 4125.587
>
> The correct fix obviously is not to break latch correctness. I think
> the big problem here is that we perform a SetLatch() for every read
> from the latch.
>
> I think we should
> a) introduce a batch variant for receiving like:
>
> extern shm_mq_result shm_mq_receivev(shm_mq_handle *mqh,
> shm_mq_iovec *iov, int *iovcnt,
> bool nowait)
>
> which then only does the SetLatch() at the end. And maybe if it
> wrapped around.
>
> b) Use shm_mq_sendv in tqueue.c by batching up insertions into the
> queue whenever it's not empty when a tuple is ready.
>

I think the patch posted in another thread has tried to achieve such a
batching by using local queues, maybe we should try some other way.

>
> Unfortunately the patch's "local worker queue" concept seems, to me,
> like it's not quite addressing the structural issues, instead opting to
> address them by adding another layer of queuing.
>

That is done to use batching the tuples while sending them. Sure, we
can do some of the other things as well, but I think the main
advantage is from batching the tuples in a smart way while sending
them and once that is done, we might not need many of the other
optimizations.

[1] - https://www.postgresql.org/message-id/CAOGQiiNiMhq5Pg3LiYxjfi2B9eAQ_q5YjS%3DfHiBJmbSOF74aBQ%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-10-23 12:50:34 Re: [POC] Faster processing at Gather node
Previous Message Ivan Kartyshov 2017-10-23 12:44:58 Re: make async slave to wait for lsn to be replayed