Re: Speed up Clog Access by increasing CLOG buffers

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-27 11:44:56
Message-ID: CAA4eK1KTbNbZSDo=6k4YgaJh_FM20zJCKu2Yt0bxaFMv9QcSXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 27, 2016 at 4:15 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 10/25/2016 06:10 AM, Amit Kapila wrote:
>>
>> On Mon, Oct 24, 2016 at 2:48 PM, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
>> wrote:
>>>
>>> On Fri, Oct 21, 2016 at 7:57 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
>>> wrote:
>>>>
>>>> On Thu, Oct 20, 2016 at 9:03 PM, Tomas Vondra
>>>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>>
>>>>> In the results you've posted on 10/12, you've mentioned a regression
>>>>> with 32
>>>>> clients, where you got 52k tps on master but only 48k tps with the
>>>>> patch (so
>>>>> ~10% difference). I have no idea what scale was used for those tests,
>>>>
>>>>
>>>> That test was with scale factor 300 on POWER 4 socket machine. I think
>>>> I need to repeat this test with multiple reading to confirm it was
>>>> regression or run to run variation. I will do that soon and post the
>>>> results.
>>>
>>>
>>> As promised, I have rerun my test (3 times), and I did not see any
>>> regression.
>>>
>>
>> Thanks Tomas and Dilip for doing detailed performance tests for this
>> patch. I would like to summarise the performance testing results.
>>
>> 1. With update intensive workload, we are seeing gains from 23%~192%
>> at client count >=64 with group_update patch [1].
>> 2. With tpc-b pgbench workload (at 1000 scale factor), we are seeing
>> gains from 12% to ~70% at client count >=64 [2]. Tests are done on
>> 8-socket intel m/c.
>> 3. With pgbench workload (both simple-update and tpc-b at 300 scale
>> factor), we are seeing gain 10% to > 50% at client count >=64 [3].
>> Tests are done on 8-socket intel m/c.
>> 4. To see why the patch only helps at higher client count, we have
>> done wait event testing for various workloads [4], [5] and the results
>> indicate that at lower clients, the waits are mostly due to
>> transactionid or clientread. At client-counts where contention due to
>> CLOGControlLock is significant, this patch helps a lot to reduce that
>> contention. These tests are done on on 8-socket intel m/c and
>> 4-socket power m/c
>> 5. With pgbench workload (unlogged tables), we are seeing gains from
>> 15% to > 300% at client count >=72 [6].
>>
>
> It's not entirely clear which of the above tests were done on unlogged
> tables, and I don't see that in the referenced e-mails. That would be an
> interesting thing to mention in the summary, I think.
>

One thing is clear that all results are on either
synchronous_commit=off or on unlogged tables. I think Dilip can
answer better which of those are on unlogged and which on
synchronous_commit=off.

>> There are many more tests done for the proposed patches where gains
>> are either or similar lines as above or are neutral. We do see
>> regression in some cases.
>>
>> 1. When data doesn't fit in shared buffers, there is regression at
>> some client counts [7], but on analysis it has been found that it is
>> mainly due to the shift in contention from CLOGControlLock to
>> WALWriteLock and or other locks.
>
>
> The questions is why shifting the lock contention to WALWriteLock should
> cause such significant performance drop, particularly when the test was done
> on unlogged tables. Or, if that's the case, how it makes the performance
> drop less problematic / acceptable.
>

Whenever the contention shifts to other lock, there is a chance that
it can show performance dip in some cases and I have seen that
previously as well. The theory behind that could be like this, say you
have two locks L1 and L2, and there are 100 processes that are
contending on L1 and 50 on L2. Now say, you reduce contention on L1
such that it leads to 120 processes contending on L2, so increased
contention on L2 can slowdown the overall throughput of all processes.

> FWIW I plan to run the same test with logged tables - if it shows similar
> regression, I'll be much more worried, because that's a fairly typical
> scenario (logged tables, data set > shared buffers), and we surely can't
> just go and break that.
>

Sure, please do those tests.

>> 2. We do see in some cases that granular_locking and no_content_lock
>> patches has shown significant increase in contention on
>> CLOGControlLock. I have already shared my analysis for same upthread
>> [8].
>
>
> I do agree that some cases this significantly reduces contention on the
> CLogControlLock. I do however think that currently the performance gains are
> limited almost exclusively to cases on unlogged tables, and some
> logged+async cases.
>

Right, because the contention is mainly visible for those workloads.

> On logged tables it usually looks like this (i.e. modest increase for high
> client counts at the expense of significantly higher variability):
>
> http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64
>

What variability are you referring to in those results?

> or like this (i.e. only partial recovery for the drop above 36 clients):
>
> http://tvondra.bitbucket.org/#pgbench-3000-logged-async-skip-64
>
> And of course, there are cases like this:
>
> http://tvondra.bitbucket.org/#dilip-300-logged-async
>
> I'd really like to understand why the patched results behave that
> differently depending on client count.
>

I have already explained this upthread [1]. Refer text after line "I
have checked the wait event results where there is more fluctuation:"

>>
>> Attached is the latest group update clog patch.
>>
>
> How is that different from the previous versions?
>

Previous patch was showing some hunks when you try to apply. I
thought it might be better to rebase so that it can be applied
cleanly, otherwise there is no change in code.

>>
>>
>> In last commit fest, the patch was returned with feedback to evaluate
>> the cases where it can show win and I think above results indicates
>> that the patch has significant benefit on various workloads. What I
>> think is pending at this stage is the either one of the committer or
>> the reviewers of this patch needs to provide feedback on my analysis
>> [8] for the cases where patches are not showing win.
>>
>> Thoughts?
>>
>
> I do agree the patch(es) significantly reduce CLogControlLock, although with
> WAL logging enabled (which is what matters for most production deployments)
> it pretty much only shifts the contention to a different lock (so the
> immediate performance benefit is 0).
>

Yeah, but I think there are use cases where users can use
synchronous_commit=off.

> Which raises the question why to commit this patch now, before we have a
> patch addressing the WAL locks. I realize this is a chicken-egg problem, but
> my worry is that the increased WALWriteLock contention will cause
> regressions in current workloads.
>

I think if we use that theory, we won't be able to make progress in
terms of reducing lock contention. I think we have previously
committed the code in such situations. For example while reducing
contention in buffer management area
(d72731a70450b5e7084991b9caa15cb58a2820df), I have noticed such a
behaviour and reported my analysis [2] as well (In the mail [2], you
can see there is performance improvement at 1000 scale factor and dip
at 5000 scale factor). Later on, when the contention on dynahash
spinlocks got alleviated (44ca4022f3f9297bab5cbffdd97973dbba1879ed),
the results were much better. If we would not have reduced the
contention in buffer management, then the benefits with dynahash
improvements wouldn't have been much in those workloads (if you want,
I can find out and share the results of dynhash improvements).

> BTW I've ran some tests with the number of clog buffers increases to 512,
> and it seems like a fairly positive. Compare for example these two results:
>
> http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
> http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip-clog-512
>
> The first one is with the default 128 buffers, the other one is with 512
> buffers. The impact on master is pretty obvious - for 72 clients the tps
> jumps from 160k to 197k, and for higher client counts it gives us about +50k
> tps (typically increase from ~80k to ~130k tps). And the tps variability is
> significantly reduced.
>

Interesting, because last time I have done such testing by increasing
clog buffers, it didn't show any improvement, rather If I remember
correctly it was showing some regression. I am not sure what is best
way to handle this, may be we can make clogbuffers as guc variable.

[1] - https://www.postgresql.org/message-id/CAA4eK1J9VxJUnpOiQDf0O%3DZ87QUMbw%3DuGcQr4EaGbHSCibx9yA%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1JUPn1rV0ep5DR74skcv%2BRRK7i2inM1X01ajG%2BgCX-hMw%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mithun Cy 2016-10-27 11:48:03 Re: Proposal : For Auto-Prewarm.
Previous Message Etsuro Fujita 2016-10-27 11:41:21 Re: Push down more full joins in postgres_fdw