Re: what to revert

From: Kevin Grittner <kgrittn(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: what to revert
Date: 2016-05-10 08:29:15
Message-ID: CACjxUsPgmm+LLG1+3d56EhCD8yEKP_b14zHGFOUpJp0Qx-J2pw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 9, 2016 at 9:01 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> Over the past few days I've been running benchmarks on a fairly
> large NUMA box (4 sockets, 32 cores / 64 with HR, 256GB of RAM)
> to see the impact of the 'snapshot too old' - both when disabled
> and enabled with various values in the old_snapshot_threshold
> GUC.

Thanks!

> The benchmark is a simple read-only pgbench with prepared
> statements, i.e. doing something like this:
>
> pgbench -S -M prepared -j N -c N

Do you have any plans to benchmark cases where the patch can have a
benefit? (Clearly, nobody would be interested in using the feature
with a read-only load; so while that makes a good "worst case"
scenario and is very valuable for testing the "off" versus
"reverted" comparison, it's not an intended use or one that's
likely to happen in production.)

> master-10-new - 91fd1df4 + old_snapshot_threshold=10
> master-10-new-2 - 91fd1df4 + old_snapshot_threshold=10 (rerun)

So, these runs were with identical software on the same data? Any
differences are just noise?

> * The results are a bit noisy, but I think in general this shows
> that for certain cases there's a clearly measurable difference
> (up to 5%) between the "disabled" and "reverted" cases. This is
> particularly visible on the smallest data set.

In some cases, the differences are in favor of disabled over
reverted.

> * What's fairly strange is that on the largest dataset (scale
> 10000), the "disabled" case is actually consistently faster than
> "reverted" - that seems a bit suspicious, I think. It's possible
> that I did the revert wrong, though - the revert.patch is
> included in the tgz. This is why I also tested 689f9a05, but
> that's also slower than "disabled".

Since there is not a consistent win of disabled or reverted over
the other, and what difference there is is often far less than the
difference between the two runs with identical software, is there
any reasonable interpretation of this except that the difference is
"in the noise"?

> * The performance impact with the feature enabled seems rather
> significant, especially once you exceed the number of physical
> cores (32 in this case). Then the drop is pretty clear - often
> ~50% or more.
>
> * 7e3da1c4 claims to bring the performance within 5% of the
> disabled case, but that seems not to be the case.

The commit comment says "At least in the tested case this brings
performance within 5% of when the feature is off, compared to
several times slower without this patch." The tested case was a
read-write load, so your read-only tests do nothing to determine
whether this was the case in general for this type of load.
Partly, the patch decreases chasing through HOT chains and
increases the number of HOT updates, so there are compensating
benefits of performing early vacuum in a read-write load.

> What it however does is bringing the 'non-immediate' cases close
> to the immediate ones (before the performance drop came much
> sooner in these cases - at 16 clients).

Right. This is, of course, just the first optimization, that we
were able to get in "under the wire" before beta, but the other
optimizations under consideration would only tend to bring the
"enabled" cases closer together in performance, not make an enabled
case perform the same as when the feature was off -- especially for
a read-only workload.

> * It's also seems to me the feature greatly amplifies the
> variability of the results, somehow. It's not uncommon to see
> results like this:
>
> master-10-new-2 235516 331976 133316 155563 133396
>
> where after the first runs (already fairly variable) the
> performance tanks to ~50%. This happens particularly with higher
> client counts, otherwise the max-min is within ~5% of the max.
> There are a few cases where this happens without the feature
> (i.e. old master, reverted or disabled), but it's usually much
> smaller than with it enabled (immediate, 10 or 60). See the
> 'summary' sheet in the ODS spreadsheet.
>
> I don't know what's the problem here - at first I thought that
> maybe something else was running on the machine, or that
> anti-wraparound autovacuum kicked in, but that seems not to be
> the case. There's nothing like that in the postgres log (also
> included in the .tgz).

I'm inclined to suspect NUMA effects. It would be interesting to
try with the NUMA patch and cpuset I submitted a while back or with
fixes in place for the Linux scheduler bugs which were reported
last month. Which kernel version was this?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Benedikt Grundmann 2016-05-10 08:42:42 Re: between not propated into a simple equality join
Previous Message Etsuro Fujita 2016-05-10 07:56:50 Re: Use %u to print user mapping's umid and userid