|From:||Robert Haas <robertmhaas(at)gmail(dot)com>|
|To:||Andres Freund <andres(at)anarazel(dot)de>|
|Cc:||Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>|
|Subject:||Re: [POC] Faster processing at Gather node|
|Views:||Raw Message | Whole Thread | Download mbox|
On Sat, Nov 4, 2017 at 5:55 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> master: 21436.745, 20978.355, 19918.617
>> patch: 15896.573, 15880.652, 15967.176
>> Median-to-median, that's about a 24% improvement.
With the attached stack of 4 patches, I get: 10811.768 ms, 10743.424
ms, 10632.006 ms, about a 49% improvement median-to-median. Haven't
tried it on hydra or any other test cases yet.
skip-gather-project-v1.patch does what it says on the tin. I still
don't have a test case for this, and I didn't find that it helped very
much, but it would probably help more in a test case with more
columns, and you said this looked like a big bottleneck in your
testing, so here you go.
shm-mq-less-spinlocks-v2.patch is updated from the version I posted
before based on your review comments. I don't think it's really
necessary to mention that the 8-byte atomics have fallbacks here;
whatever needs to be said about that should be said in some central
place that talks about atomics, not in each user individually. I
agree that there might be some further speedups possible by caching
some things in local memory, but I haven't experimented with that.
shm-mq-reduce-receiver-latch-set-v1.patch causes the receiver to only
consume input from the shared queue when the amount of unconsumed
input exceeds 1/4 of the queue size. This caused a large performance
improvement in my testing because it causes the number of times the
latch gets set to drop dramatically. I experimented a bit with
thresholds of 1/8 and 1/2 before setting on 1/4; 1/4 seems to be
enough to capture most of the benefit.
remove-memory-leak-protection-v1.patch removes the memory leak
protection that Tom installed upon discovering that the original
version of tqueue.c leaked memory like crazy. I think that it
shouldn't do that any more, courtesy of
6b65a7fe62e129d5c2b85cd74d6a91d8f7564608. Assuming that's correct, we
can avoid a whole lot of tuple copying in Gather Merge and a much more
modest amount of overhead in Gather. Since my test case exercised
Gather Merge, this bought ~400 ms or so.
Even with all of these patches applied, there's clearly still room for
more optimization, but MacOS's "sample" profiler seems to show that
the bottlenecks are largely shifting elsewhere:
Sort by top of stack, same collapsed (when >= 5):
slot_getattr (in postgres) 706
slot_deform_tuple (in postgres) 560
ExecAgg (in postgres) 378
ExecInterpExpr (in postgres) 372
AllocSetAlloc (in postgres) 319
read (in libsystem_kernel.dylib) 303
heap_compare_slots (in postgres) 296
combine_aggregates (in postgres) 273
shm_mq_receive_bytes (in postgres) 272
I'm probably not super-excited about spending too much more time
trying to make the _platform_memmove time (only 20% or so of which
seems to be due to the shm_mq stuff) or the shm_mq_receive_bytes time
go down until, say, somebody JIT's slot_getattr and slot_deform_tuple.
One thing that might be worth doing is hammering on the AllocSetAlloc
time. I think that's largely caused by allocating space for heap
tuples and then freeing them and allocating space for new heap tuples.
Gather/Gather Merge are guilty of that, but I think there may be other
places in the executor with the same issue. Maybe we could have
fixed-size buffers for small tuples that just get reused and only
palloc for large tuples (cf. SLAB_SLOT_SIZE).
The Enterprise PostgreSQL Company
|Next Message||Andres Freund||2017-11-05 01:24:04||Re: [POC] Faster processing at Gather node|
|Previous Message||Peter Geoghegan||2017-11-04 23:35:59||Re: Small improvement to compactify_tuples|