Re: [POC] Faster processing at Gather node

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] Faster processing at Gather node
Date: 2017-11-05 14:22:31
Message-ID: CA+TgmoZ+b6xeiwrsTOA6_9qFtbsx4Vyz-GsjuaPxEqYhZLK+4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Nov 5, 2017 at 2:24 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> shm-mq-reduce-receiver-latch-set-v1.patch causes the receiver to only
>> consume input from the shared queue when the amount of unconsumed
>> input exceeds 1/4 of the queue size. This caused a large performance
>> improvement in my testing because it causes the number of times the
>> latch gets set to drop dramatically. I experimented a bit with
>> thresholds of 1/8 and 1/2 before setting on 1/4; 1/4 seems to be
>> enough to capture most of the benefit.
>
> Hm. Is consuming the relevant part, or notifying the sender about it? I
> suspect most of the benefit can be captured by updating bytes read (and
> similarly on the other side w/ bytes written), but not setting the latch
> unless thresholds are reached. The advantage of updating the value,
> even without notifying the other side, is that in the common case that
> the other side gets around to checking the queue without having blocked,
> it'll see the updated value. If that works, that'd address the issue
> that we might wait unnecessarily in a number of common cases.

I think it's mostly notifying the sender. Sending SIGUSR1 over and
over again isn't free, and it shows up in profiling. I thought about
what you're proposing here, but it seemed more complicated to
implement, and I'm not sure that there would be any benefit. The
reason is because, with these patches applied, even a radical
expansion of the queue size doesn't produce much incremental
performance benefit at least in the test case I was using. I can
increase the size of the tuple queues 10x or 100x and it really
doesn't help very much. And consuming sooner (but sometimes without
notifying) seems very similar to making the queue slightly bigger.

Also, what I see in general is that the CPU usage on the leader goes
to 100% but the workers are only maybe 20% saturated. Making the
leader work any harder than absolutely necessarily therefore seems
like it's probably counterproductive. I may be wrong, but it looks to
me like most of the remaining overhead seems to come from (1) the
synchronization overhead associated with memory barriers and (2)
backend-private work that isn't as cheap as would be ideal - e.g.
palloc overhead.

> Interesting. Here it's
> + 8.79% postgres postgres [.] ExecAgg
> + 6.52% postgres postgres [.] slot_deform_tuple
> + 5.65% postgres postgres [.] slot_getattr
> + 4.59% postgres postgres [.] shm_mq_send_bytes
> + 3.66% postgres postgres [.] ExecInterpExpr
> + 3.44% postgres postgres [.] AllocSetAlloc
> + 3.08% postgres postgres [.] heap_fill_tuple
> + 2.34% postgres postgres [.] heap_getnext
> + 2.25% postgres postgres [.] finalize_aggregates
> + 2.08% postgres libc-2.24.so [.] __memmove_avx_unaligned_erms
> + 2.05% postgres postgres [.] heap_compare_slots
> + 1.99% postgres postgres [.] execTuplesMatch
> + 1.83% postgres postgres [.] ExecStoreTuple
> + 1.83% postgres postgres [.] shm_mq_receive
> + 1.81% postgres postgres [.] ExecScan

More or less the same functions, somewhat different order.

>> I'm probably not super-excited about spending too much more time
>> trying to make the _platform_memmove time (only 20% or so of which
>> seems to be due to the shm_mq stuff) or the shm_mq_receive_bytes time
>> go down until, say, somebody JIT's slot_getattr and slot_deform_tuple.
>> :-)
>
> Hm, let's say somebody were working on something like that. In that case
> the benefits for this precise plan wouldn't yet be that big because a
> good chunk of slot_getattr calls come from execTuplesMatch() which
> doesn't really provide enough context to do JITing (when used for
> hashaggs, there is more so it's JITed). Similarly gather merge's
> heap_compare_slots() doesn't provide such context.
>
> It's about ~9% currently, largely due to the faster aggregate
> invocation. But the big benefit here would be all the deforming and the
> comparisons...

I'm not sure I follow you here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Emre Hasegeli 2017-11-05 14:54:19 Re: [PATCH] Improve geometric types
Previous Message Michael Paquier 2017-11-05 12:17:12 Re: Early locking option to parallel backup