Re: Speed up Clog Access by increasing CLOG buffers

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-30 18:32:48
Message-ID: b3586234-6c80-5b64-1261-871e0e852bbb@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 10/27/2016 01:44 PM, Amit Kapila wrote:
> On Thu, Oct 27, 2016 at 4:15 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> FWIW I plan to run the same test with logged tables - if it shows similar
>> regression, I'll be much more worried, because that's a fairly typical
>> scenario (logged tables, data set > shared buffers), and we surely can't
>> just go and break that.
>>
>
> Sure, please do those tests.
>

OK, so I do have results for those tests - that is, scale 3000 with
shared_buffers=16GB (so continuously writing out dirty buffers). The
following reports show the results slightly differently - all three "tps
charts" next to each other, then the speedup charts and tables.

Overall, the results are surprisingly positive - look at these results
(all ending with "-retest"):

[1] http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest

[2]
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-noskip-retest

[3]
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest

All three show significant improvement, even with fairly low client
counts. For example with 72 clients, the tps improves 20%, without
significantly affecting variability variability of the results( measured
as stdddev, more on this later).

It's however interesting that "no_content_lock" is almost exactly the
same as master, while the other two cases improve significantly.

The other interesting thing is that "pgbench -N" [3] shows no such
improvement, unlike regular pgbench and Dilip's workload. Not sure why,
though - I'd expect to see significant improvement in this case.

I have also repeated those tests with clog buffers increased to 512 (so
4x the current maximum of 128). I only have results for Dilip's workload
and "pgbench -N":

[4]
http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest-512

[5]
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest-512

The results are somewhat surprising, I guess, because the effect is
wildly different for each workload.

For Dilip's workload increasing clog buffers to 512 pretty much
eliminates all benefits of the patches. For example with 288 client, the
group_update patch gives ~60k tps on 128 buffers [1] but only 42k tps on
512 buffers [4].

With "pgbench -N", the effect is exactly the opposite - while with 128
buffers there was pretty much no benefit from any of the patches [3],
with 512 buffers we suddenly get almost 2x the throughput, but only for
group_update and master (while the other two patches show no improvement
at all).

I don't have results for the regular pgbench ("noskip") with 512 buffers
yet, but I'm curious what that will show.

In general I however think that the patches don't show any regression in
any of those workloads (at least not with 128 buffers). Based solely on
the results, I like the group_update more, because it performs as good
as master or significantly better.

>>> 2. We do see in some cases that granular_locking and
>>> no_content_lock patches has shown significant increase in
>>> contention on CLOGControlLock. I have already shared my analysis
>>> for same upthread [8].
>>

I've read that analysis, but I'm not sure I see how it explains the "zig
zag" behavior. I do understand that shifting the contention to some
other (already busy) lock may negatively impact throughput, or that the
group_update may result in updating multiple clog pages, but I don't
understand two things:

(1) Why this should result in the fluctuations we observe in some of the
cases. For example, why should we see 150k tps on, 72 clients, then drop
to 92k with 108 clients, then back to 130k on 144 clients, then 84k on
180 clients etc. That seems fairly strange.

(2) Why this should affect all three patches, when only group_update has
to modify multiple clog pages.

For example consider this:

http://tvondra.bitbucket.org/index2.html#dilip-300-logged-async

For example looking at % of time spent on different locks with the
group_update patch, I see this (ignoring locks with ~1%):

event_type wait_event 36 72 108 144 180 216 252 288
---------------------------------------------------------------------
- - 60 63 45 53 38 50 33 48
Client ClientRead 33 23 9 14 6 10 4 8
LWLockNamed CLogControlLock 2 7 33 14 34 14 33 14
LWLockTranche buffer_content 0 2 9 13 19 18 26 22

I don't see any sign of contention shifting to other locks, just
CLogControlLock fluctuating between 14% and 33% for some reason.

Now, maybe this has nothing to do with PostgreSQL itself, but maybe it's
some sort of CPU / OS scheduling artifact. For example, the system has
36 physical cores, 72 virtual ones (thanks to HT). I find it strange
that the "good" client counts are always multiples of 72, while the
"bad" ones fall in between.

72 = 72 * 1 (good)
108 = 72 * 1.5 (bad)
144 = 72 * 2 (good)
180 = 72 * 2.5 (bad)
216 = 72 * 3 (good)
252 = 72 * 3.5 (bad)
288 = 72 * 4 (good)

So maybe this has something to do with how OS schedules the tasks, or
maybe some internal heuristics in the CPU, or something like that.

>> On logged tables it usually looks like this (i.e. modest increase for high
>> client counts at the expense of significantly higher variability):
>>
>> http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64
>>
>
> What variability are you referring to in those results?
>

Good question. What I mean by "variability" is how stable the tps is
during the benchmark (when measured on per-second granularity). For
example, let's run a 10-second benchmark, measuring number of
transactions committed each second.

Then all those runs do 1000 tps on average:

run 1: 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000
run 2: 500, 1500, 500, 1500, 500, 1500, 500, 1500, 500, 1500
run 3: 0, 2000, 0, 2000, 0, 2000, 0, 2000, 0, 2000

I guess we agree those runs behave very differently, despite having the
same throughput. So this is what STDDEV(tps) measures, i.e. the third
chart on the reports, shows.

So for example this [6] shows that the patches give us higher throughput
with >= 180 clients, but we also pay for that with increased variability
of the results (i.e. the tps chart will have jitter):

[6]
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-64

Of course, exchanging throughput, latency and variability is one of the
crucial trade-offs in transactions systems - at some point the resources
get saturated and higher throughput can only be achieved in exchange for
latency (e.g. by grouping requests). But still, we'd like to get stable
tps from the system, not something that gives us 2000 tps one second and
0 tps the next one.

Of course, this is not perfect - it does not show whether there are
transactions with significantly higher latency, and so on. It'd be good
to also measure latency, but I haven't collected that info during the
runs so far.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2016-10-30 22:44:46 Re: sources.sgml typo
Previous Message Tom Lane 2016-10-30 17:40:29 Re: [sqlsmith] Missing CHECK_FOR_INTERRUPTS in tsquery_rewrite