Re: Reduce ProcArrayLock contention

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reduce ProcArrayLock contention
Date: 2015-07-24 10:56:11
Message-ID: CABOikdM6oyr25AkAyVhhpC1vO7amwbS3rjdZj3tjsGS7L-n6xQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 29, 2015 at 8:57 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

>
>
> pgbench setup
> ------------------------
> scale factor - 300
> Data is on magnetic disk and WAL on ssd.
> pgbench -M prepared tpc-b
>
> Head : commit 51d0fe5d
> Patch -1 : group_xid_clearing_at_trans_end_rel_v1
>
>
> Client Count/TPS18163264128HEAD814609210899199262363617812Patch-110866483
> 11093199083122028237
>
> The graph for the data is attached.
>
>
Numbers look impressive and definitely shows that the idea is worth
pursuing. I tried patch on my laptop. Unfortunately, at least for 4 and 8
clients, I did not see any improvement. In fact, averages over 2 runs
showed a slight 2-4% decline in the tps. Having said that, there is no
reason to disbelieve your numbers and no much powerful machines, we might
see the gains.

BTW I ran the tests with, pgbench -s 10 -c 4 -T 300

Points about performance data
> ---------------------------------------------
> 1. Gives good performance improvement at or greater than 64 clients
> and give somewhat moderate improvement at lower client count. The
> reason is that because the contention around ProcArrayLock is mainly
> seen at higher client count. I have checked that at higher client-count,
> it started behaving lockless (which means performance with patch is
> equivivalent to if we just comment out ProcArrayLock in
> ProcArrayEndTransaction()).
>

Well, I am not entirely sure if thats a correct way of looking at it. Sure,
you would see less contention on the ProcArrayLock because the fact is that
there are far fewer backends trying to acquire it. But those who don't get
the lock will sleep and hence the contention is moved somewhere else, at
least partially.

> 2. There is some noise in this data (at 1 client count, I don't expect
> much difference).
> 3. I have done similar tests on power-8 m/c and found similar gains.
>

As I said, I'm not seeing benefits on my laptop (Macbook Pro, Quad core,
SSD). But then I ran with much lower scale factor and much lesser number of
clients.

> 4. The gains are visible when the data fits in shared_buffers as for other
> workloads I/O starts dominating.
>

Thats seems be perfectly expected.

> 5. I have seen that effect of Patch is much more visible if we keep
> autovacuum = off (do manual vacuum after each run) and keep
> wal_writer_delay to lower value (say 20ms).
>

Do you know why that happens? Is it because the contention moves somewhere
else with autovacuum on?

Regarding the design itself, I've an idea that may be we can create a
general purpose infrastructure to use this technique. If its useful here,
I'm sure there are other places where this can be applied with similar
effect.

For example, how about adding an API such as LWLockDispatchWork(lock, mode,
function_ptr, data_ptr)? Here the data_ptr points to somewhere in shared
memory that the function_ptr can work on once lock is available. If the
lock is available in the requested mode then the function_ptr is executed
with the given data_ptr and the function returns. If the lock is not
available then the work is dispatched to some Q (tracked on per-lock
basis?) and the process goes to sleep. Whenever the lock becomes available
in the requested mode, the work is executed by some other backedn and the
primary process is woken up. This will most likely happen in the
LWLockRelease() path when the last holder is about to give up the lock so
that it becomes available in the requested "mode". Now there is lot of
handwaving here and I'm not sure if the LWLock infrastructure permits us to
add something like this easily. But I thought I will put forward the idea
anyways. In fact, I remember trying something of this sort a long time
back, but can't recollect why I gave up on the idea. May be I did not see
much benefit of the entire approach of clubbing work-pieces and doing them
in a single process. But then I probably did not have access to powerful
machines then to correctly measure the benefits. Hence I'm not willing to
give up on the idea, especially given your test results.

BTW may be the LWLockDispatchWork() makes sense only for EXCLUSIVE locks
because we tend to read from shared memory and populate local structures in
READ mode and that can only happen in the primary backend itself.

Regarding the patch, the compare-and-exchange function calls that you've
used would work only for 64-bit machines, right? You would need to use
equivalent 32-bit calls on a 32-bit machine.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeevan Chalke 2015-07-24 11:18:33 Re: [HACKERS] GSets: Fix bug involving GROUPING and HAVING together
Previous Message Andrew Gierth 2015-07-24 10:34:22 Re: [HACKERS] GSets: Fix bug involving GROUPING and HAVING together