Quick Links

Re: [POC] Faster processing at Gather node

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [POC] Faster processing at Gather node
Date:	2017-10-17 21:39:57
Message-ID:	20171017213957.hylzfplumptbfvro@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Everyone,

On 2017-05-19 17:25:38 +0530, Rafia Sabih wrote:
> While analysing the performance of TPC-H queries for the newly developed
> parallel-operators, viz, parallel index, bitmap heap scan, etc. we noticed
> that the time taken by gather node is significant. On investigation, as per
> the current method it copies each tuple to the shared queue and notifies
> the receiver. Since, this copying is done in shared queue, a lot of locking
> and latching overhead is there.
>
> So, in this POC patch I tried to copy all the tuples in a local queue thus
> avoiding all the locks and latches. Once, the local queue is filled as per
> it's capacity, tuples are transferred to the shared queue. Once, all the
> tuples are transferred the receiver is sent the notification about the same.
>
> With this patch I could see significant improvement in performance for
> simple queries,

I've spent some time looking into this, and I'm not quite convinced this
is the right approach. Let me start by describing where I see the
current performance problems around gather stemming from.

The observations here are made using
select * from t where i < 30000000 offset 29999999 - 1;
with Rafia's data. That avoids slowdowns on the leader due to too many
rows printed out. Sometimes I'll also use
SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 1000000000 LIMIT 1;
on tpch data to show the effects on wider tables.

The precise query doesn't really matter, the observations here are more
general, I hope.

1) nodeGather.c re-projects every row from workers. As far as I can tell
that's now always exactly the same targetlist as it's coming from the
worker. Projection capability was added in 8538a6307049590 (without
checking whether it's needed afaict), but I think it in turn often
obsoleted by 992b5ba30dcafdc222341505b072a6b009b248a7. My
measurement shows that removing the projection yields quite massive
speedups in queries like yours, which is not too surprising.

I suspect this just needs a tlist_matches_tupdesc check + an if
around ExecProject(). And a test, right now tests pass unless
force_parallel_mode is used even if just commenting out the
projection unconditionally.

before commenting out nodeGather projection:

rafia time: 8283.583
rafia profile:
+ 30.62% postgres postgres [.] shm_mq_receive
+ 18.49% postgres postgres [.] s_lock
+ 10.08% postgres postgres [.] SetLatch
- 7.02% postgres postgres [.] slot_deform_tuple
- slot_deform_tuple
- 88.01% slot_getsomeattrs
ExecInterpExpr
ExecGather
ExecLimit

lineitem time: 8448.468
lineitem profile:
+ 24.63% postgres postgres [.] slot_deform_tuple
+ 24.43% postgres postgres [.] shm_mq_receive
+ 17.36% postgres postgres [.] ExecInterpExpr
+ 7.41% postgres postgres [.] s_lock
+ 5.73% postgres postgres [.] SetLatch

after:
rafia time: 6660.224
rafia profile:
+ 36.77% postgres postgres [.] shm_mq_receive
+ 19.33% postgres postgres [.] s_lock
+ 13.14% postgres postgres [.] SetLatch
+ 9.22% postgres postgres [.] AllocSetReset
+ 4.27% postgres postgres [.] ExecGather
+ 2.79% postgres postgres [.] AllocSetAlloc

lineitem time: 4507.416
lineitem profile:
+ 34.81% postgres postgres [.] shm_mq_receive
+ 15.45% postgres postgres [.] s_lock
+ 13.38% postgres postgres [.] SetLatch
+ 9.87% postgres postgres [.] AllocSetReset
+ 5.82% postgres postgres [.] ExecGather

as quite clearly visible, avoiding the projection yields some major
speedups.

The following analysis here has the projection removed.

2) The spinlocks both on the the sending and receiving side a quite hot:

rafia query leader:
+ 36.16% postgres postgres [.] shm_mq_receive
+ 19.49% postgres postgres [.] s_lock
+ 13.24% postgres postgres [.] SetLatch

The presence of s_lock shows us that we're clearly often contending
on spinlocks, given that's the slow-path for SpinLockAcquire(). In
shm_mq_receive the instruction profile shows:

│ SpinLockAcquire(&mq->mq_mutex);
│1 5ab: mov $0xa9b580,%ecx
│ mov $0x4a4,%edx
│ mov $0xa9b538,%esi
│ mov %r15,%rdi
│ → callq s_lock
│ ↑ jmpq 2a1
│ tas():
│1 5c7: mov $0x1,%eax
32.83 │ lock xchg %al,(%r15)
│ shm_mq_inc_bytes_read():
│ SpinLockAcquire(&mq->mq_mutex);
and
0.01 │ pop %r15
0.04 │ ← retq
│ nop
│ tas():
│1 338: mov $0x1,%eax
17.59 │ lock xchg %al,(%r15)
│ shm_mq_get_bytes_written():
│ SpinLockAcquire(&mq->mq_mutex);
0.05 │ test %al,%al
0.01 │ ↓ jne 448
│ v = mq->mq_bytes_written;

rafia query worker:
+ 33.00% postgres postgres [.] shm_mq_send_bytes
+ 9.90% postgres postgres [.] s_lock
+ 7.74% postgres postgres [.] shm_mq_send
+ 5.40% postgres postgres [.] ExecInterpExpr
+ 5.34% postgres postgres [.] SetLatch

Again, we have strong indicators for a lot of spinlock
contention. The instruction profiles show the same;

shm_mq_send_bytes
│ shm_mq_inc_bytes_written(mq, MAXALIGN(sendnow));
│ and $0xfffffffffffffff8,%r15
│ tas():
0.08 │ mov %ebp,%eax
31.07 │ lock xchg %al,(%r14)
│ shm_mq_inc_bytes_written():
│ * Increment the number of bytes written.
│ */

and

│3 98: cmp %r13,%rbx
0.70 │ ↓ jae 430
│ tas():
0.12 │1 a1: mov %ebp,%eax
28.53 │ lock xchg %al,(%r14)
│ shm_mq_get_bytes_read():
│ SpinLockAcquire(&mq->mq_mutex);
│ test %al,%al
│ ↓ jne 298
│ v = mq->mq_bytes_read;

shm_mq_send:
│ tas():
50.73 │ lock xchg %al,0x0(%r13)
│ shm_mq_notify_receiver():
│ shm_mq_notify_receiver(volatile shm_mq *mq)
│ {
│ PGPROC *receiver;
│ bool detached;

My interpretation here is that it's not just the effect of the
locking causing the slowdown, but to a significant degree the effect
of the implied bus lock.

To test that theory, here are the timings for
1) spinlocks present
time: 6593.045
2) spinlocks acuisition replaced by *full* memory barriers, which on x86 is a lock; addl $0,0(%%rsp)
time: 5159.306
3) spinlocks replaced by read/write barriers as appropriate.
time: 4687.610

By my understanding of shm_mq.c's logic, something like 3) aught to
be doable in a correct manner. There should be, in normal
circumstances, only be one end modifying each of the protected
variables. Doing so instead of using full block atomics with locked
instructions is very likely to yield considerably better performance.

The top-level profile after 3 is:

leader:
+ 25.89% postgres postgres [.] shm_mq_receive
+ 24.78% postgres postgres [.] SetLatch
+ 14.63% postgres postgres [.] AllocSetReset
+ 7.31% postgres postgres [.] ExecGather

worker:
+ 14.02% postgres postgres [.] ExecInterpExpr
+ 11.63% postgres postgres [.] shm_mq_send_bytes
+ 11.25% postgres postgres [.] heap_getnext
+ 6.78% postgres postgres [.] SetLatch
+ 6.38% postgres postgres [.] slot_deform_tuple

still a lot of cycles in the queue code, but proportionally less.

4) Modulo computations in shm_mq are expensive:

│ shm_mq_send_bytes():
│ Size offset = mq->mq_bytes_written % (uint64) ringsize;
0.12 │1 70: xor %edx,%edx
│ Size sendnow = Min(available, ringsize - offset);
│ mov %r12,%rsi
│ Size offset = mq->mq_bytes_written % (uint64) ringsize;
43.75 │ div %r12
│ memcpy(&mq->mq_ring[mq->mq_ring_offset + offset],
7.23 │ movzbl 0x31(%r15),%eax

│ shm_mq_receive_bytes():
│ used = written - mq->mq_bytes_read;
1.17 │ sub %rax,%rcx
│ offset = mq->mq_bytes_read % (uint64) ringsize;
18.49 │ div %rbp
│ mov %rdx,%rdi

that we end up with a full blown div makes sense - the compiler can't
know anything about ringsize, therefore it can't optimize to a mask.
I think we should force the size of the ringbuffer to be a power of
two, and use a maks instead of a size for the buffer.

5) There's a *lot* of latch interactions. The biggest issue actually is
the memory barrier implied by a SetLatch, waiting for the latch
barely shows up.

from 4) above:

leader:
+ 25.89% postgres postgres [.] shm_mq_receive
+ 24.78% postgres postgres [.] SetLatch
+ 14.63% postgres postgres [.] AllocSetReset
+ 7.31% postgres postgres [.] ExecGather

│ 0000000000781b10 <SetLatch>:
│ SetLatch():
│ /*
│ * The memory barrier has to be placed here to ensure that any flag
│ * variables possibly changed by this process have been flushed to main
│ * memory, before we check/set is_set.
│ */
│ pg_memory_barrier();
77.43 │ lock addl $0x0,(%rsp)
│
│ /* Quick exit if already set */
│ if (latch->is_set)
0.12 │ mov (%rdi),%eax

Commenting out the memory barrier - which is NOT CORRECT - improves
timing:
before: 4675.626
after: 4125.587

The correct fix obviously is not to break latch correctness. I think
the big problem here is that we perform a SetLatch() for every read
from the latch.

I think we should
a) introduce a batch variant for receiving like:

extern shm_mq_result shm_mq_receivev(shm_mq_handle *mqh,
shm_mq_iovec *iov, int *iovcnt,
bool nowait)

which then only does the SetLatch() at the end. And maybe if it
wrapped around.

b) Use shm_mq_sendv in tqueue.c by batching up insertions into the
queue whenever it's not empty when a tuple is ready.

I've not benchmarked this, but I'm pretty certain that the benefits
isn't just going to be reduced cost of SetLatch() calls, but also
increased performance due to fewer context switches

6) I've observed, using strace, debug outputs with timings, and top with
a short interval, that quite often only one backend has sufficient
work, while other backends are largely idle.

I think the problem here is that the "anti round robin" provisions from
bc7fcab5e36b9597857, while much better than the previous state, have
swung a bit too far into the other direction. Especially if we were
to introduce batching as I suggest in 5), but even without, this
leads to back-fort on half-empty queues between the gatherstate->nextreader
backend, and the leader.

I'm not 100% certain what the right fix here is.

One fairly drastic solution would be to move away from a
single-produce-single-consumer style, per worker, queue, to a global
queue. That'd avoid fairness issues between the individual workers,
at the price of potential added contention. One disadvantage is that
such a combined queue approach is not easily applicable for gather
merge.

One less drastic approach would be to move to try to drain the queue
fully in one batch, and then move to the next queue. That'd avoid
triggering a small wakeups for each individual tuple, as one
currently would get without the 'stickyness'.

It might also be a good idea to use a more directed form of wakeups,
e.g. using a per-worker latch + a wait event set, to avoid iterating
over all workers.

Unfortunately the patch's "local worker queue" concept seems, to me,
like it's not quite addressing the structural issues, instead opting to
address them by adding another layer of queuing. I suspect that if we'd
go for the above solutions there'd be only very small benefit in
implementing such per-worker local queues.

Greetings,

Andres Freund

In response to

[POC] Faster processing at Gather node at 2017-05-19 11:55:38 from Rafia Sabih

Responses

Re: [POC] Faster processing at Gather node at 2017-10-18 05:33:41 from Andres Freund
Re: [POC] Faster processing at Gather node at 2017-10-18 19:46:39 from Robert Haas
Re: [POC] Faster processing at Gather node at 2017-10-23 12:46:59 from Amit Kapila
Re: [POC] Faster processing at Gather node at 2017-11-04 11:08:31 from Robert Haas
Re: [POC] Faster processing at Gather node at 2017-11-05 21:33:24 from Jim Van Fleet

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Craig Ringer	2017-10-18 00:14:36	Re: Re: Is anything preventing us from allowing write to foreign tables from standby?
Previous Message	David G. Johnston	2017-10-17 20:58:32	v10 telease note for pg_basebackup refers to old --xlog-method argument