Re: [HACKERS] [WIP] Zipfian distribution in pgbench

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Alik Khilazhev <a(dot)khilazhev(at)postgrespro(dot)ru>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Zipfian distribution in pgbench
Date: 2018-10-08 20:44:20
Message-ID: CAH2-Wznthi4=qqBF56mWNYi-MuOzMR1mSGC7gLyi6ay46f3=Jw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 7, 2017 at 12:45 AM Alik Khilazhev
<a(dot)khilazhev(at)postgrespro(dot)ru> wrote:
> PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% UPDATE of random row by PK) on benchmarking with big number of clients using Zipfian distribution. MySQL also has decline but it is not significant as it is in PostgreSQL. MongoDB does not have decline at all. And if pgbench would have Zipfian distribution random number generator, everyone will be able to make research on this topic without using YCSB.

I've been thinking about this again, in light of my recent work to
improve the B-Tree code by making all entries have fully unique keys,
adding suffix truncation, and making better choices about split points
[1].

Somebody posted a patch that fixed the issue by making the heavyweight
lock manager lock a value rather than a heap TID within the last year.
I can't find that thread right now -- perhaps someone can point it out
now. I suspect that my patch will have a similar effect to this other
patch I'm thinking of with the Zipfian workload, though it will do so
without touching anything outside of the B-Tree code, and without
changing the user-visible semantics of locking.

Can we possibly find a way to rerun this Zipfian experiment/test case?
I bet Alexander's recent B-Tree patch will also help.

Here is why I suspect my patch will help with the Zipfian workload,
particularly on a multi-socket machine:

The logic for following downlinks in Postgres is that downlinks are a
lower bound, but we still cannot follow them when the key is equal to
our insertion scankey -- we must go left, even when we have all keys
filled out. This means that we sometimes end up holding an exclusive
buffer lock on "the first leaf page the value could be on", even
though that page doesn't have any such values (they're all in later
pages, to the left). So we accidentally lock-out insertions of the
value 1 when we insert the value 2, and of the value 2 when we insert
the value 3, and of the value 3 when we insert the value 4, and so on
down the line. This is the worse possible thing for the Zipfian test
case, I think, since most of the contention is artificial.

I think that my patch may ameliorate the problem, because:

* We make the tie-breaker heap TID attribute have DESC sort order, so
generally speaking contention starts on the leftmost page among leaf
pages full of duplicates -- for reads, and for writes.

* The patch works very hard to make sure that large groups of
duplicates are not spread across pages. There would be no mixing of
leaf pages containing the value 1 and the value 2 with the Zipfian
test case, for example. Old duplicates would be packed into full
pages.

* We can invent a fake lower-than-any-real-value heap TID in the
insertion scankey in certain cases that don't have one. This way, the
scan key as a whole is *not* equal to pivot tuples that are at least
equal on non-heap-TID attributes, so we *can* follow a downlink that
is "equal" to our scankey, provided it has a truncated-away heap TID
attribute value.

In short, the artificial "false sharing" that we see in the B-Tree for
the Zipfian test case may go away. There will be *no* contention
between an inserter (or really UPDATE-er) that inserts the value 2
with concurrent inserters of the value 1. All of the ways in which
that was likely to happen are each fixed by the patch. There is still
buffer lock contention on heap pages, and the contention of waiting
for a row lock. Maybe the busy waiting will automatically be fixed,
though, since you have one point of contention for each of the really
hot values at the far left of the leaf level.

[1] https://commitfest.postgresql.org/20/1787/
--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2018-10-08 22:05:19 Re: Partial index plan/cardinality costing
Previous Message Tom Lane 2018-10-08 19:53:48 Re: transction_timestamp() inside of procedures