Quick Links

Re: B-Tree support function number 3 (strxfrm() optimization)

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thom Brown <thom(at)linux(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: B-Tree support function number 3 (strxfrm() optimization)
Date:	2014-06-06 00:37:33
Message-ID:	CAM3SWZQKwELa58h1R9sxwAOCJpgs=xJbiu7apDyq9pUuSfQX6g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Apr 8, 2014 at 2:57 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> Hmm. I would expect the worst case to be where the strxfrm is not helping
> because all the entries have the same prefix, but the actual key is as short
> and cheap-to-compare as possible. So the padding should be as short as
> possible. Also, we have a fast path for pre-sorted input, which reduces the
> number of comparisons performed; that will make the strxfrm overhead more
> significant.
>
> I'm getting about 2x slowdown on this test case:
>
> create table sorttest (t text);
> insert into sorttest select 'foobarfo' || (g) || repeat('a', 75) from
> generate_series(10000, 30000) g;

One thing that isn't all that obvious about this worst case is that
it's in general very qsort() friendly, and therefore the startup costs
(copying) totally dominates. Actually, you're not even sorting -
you're verifying that the tuples are already exactly in order (a
questionable optimization we apply at every level). Consider:

postgres=# select correlation from pg_stats where tablename =
'sorttest' and attname = 't';
correlation
-------------
1
(1 row)

So you only have to do n - 1 comparisons, never n log n. This is a
very skewed test. There is no reason to think that strcoll() is better
than strxfrm() + strcmp() if the first difference is earlier. That's
the main reason why what I called a worse case is much more
representative than this, even if we assume that it's very common for
people to sort strings that don't differ in the first 8 bytes and yet
are not identical.

I took a look at fixing your worst case all the same, despite becoming
discouraged during discussion at pgCon. I thought about it for a
while, and realized that the problems are probably solvable. Even
still, I would like to emphasize that this is a family of techniques
for B-Tree operator classes. Sorting heap tuples is not the only use
case, and it's probably not even the most interesting one, yet all the
criticisms are largely confined to sorting alone. This is because
wasted cycles on a poor man's comparison are not too problematic if
the only additional work is a few CPU instructions for comparison (and
not the entire transformation, which happens when that cost can be
amortized over a huge length of time). I'm thinking of inner pages in
B-Trees here. I've looked at heap tuple sorting first because the
sortsupport infrastructure happens to be available, making it the best
proving ground. That's the only reason.

Anyway, major steps against worst case performance in the attached
revision include:

* A configure AC_TRY_RUN tests the suitability of the system strxfrm()
implementation for the purposes of this optimization. There can be no
"header bytes" (people at pgCon reported redundant bytes on the Mac
OSX implementation at pgCon, to my great surprise), and with ASCII
code points you get the full benefit of being able to store the
leading 8 code points in the poorman's key (we also insist that
sizeof(Datum) == 8 to apply the poor man's optimization). It might be
that this restricts the optimization to 64-bit Linux entirely, where
glibc is almost universally used. If that's true, I consider it to be
a acceptable given that the majority of Postgres systems use this
family of platforms in practice. Even still, the fact that every
implementation doesn't meet my standard came as a big surprise to me,
and so I hope that the problem is limited to Mac OSX. I'm slightly
concerned that all BSD systems are affected by this issue, but even if
that was true not covering those cases would be acceptable in my view.
*Maybe* we could come up with a scheme for stripping those header
bytes if someone feels very strongly about it, and we judge it to be
worth it to get our hands dirty in that way (we could even dynamically
verify the header bytes always matched, and bail if our assumption was
somehow undermined at little cost on those platforms).

* The dominant additional cost in Heikki's worst case is wasted
strxfrm() transformations (or maybe just the memcpy() overhead
required to NULL terminate text for strxfrm() processing). When
copying over heap tuples into the memtuples array, we instrument the
costs and weigh them against the likely eventual benefits using
heuristic rules. We may choose to abort the transformation process
early if it doesn't look promising. This process includes maintaining
the mean query length so far (which is taken as a proxy for
transformation costs), as well as maintaining the approximate
cardinality of the set of already generated poor man's keys using the
HyperLogLog algorithm [1].

This almost completely fixes the worst case performance Heikki
complained of. With smaller strings you might get my implementation to
show worse regressions than what you'll see for your worst case by
placing careful traps for the heuristics that now appear within
bttext_abort(), but I suspect not by terribly much. These are still
fairly implausible cases.

The cost of adding all of these ameliorating measures appears to be
very low. We're so bottlenecked on memory bandwidth that the
fixed-size overhead of maintaining poor man cardinality, and the extra
cycles from hashing n keys for the benefit of HyperLogLog just don't
seem to matter next to the big savings for n log n comparisons. The
best case performance is seemingly just as good as before (about a
200% improvement in transaction throughput for one client, but closer
to a 250% improvement with many clients and/or expensive collations),
while the worst case is much much better. As the HyperLogLog paper [2]
says:

"In practice, the HYPERLOGLOG program is quite efﬁcient: like the
standard LOGLOG of [10] or MINCOUNT of [16], its running time is
dominated by the computation of the hash function, so that it is only
three to four times slower than a plain scan of the data (e.g., by the
Unix command “wc -l”, which merely counts end-of-lines)."

(Since we're doing all that copying, and everything else anyway, and
only hashing the first 8 bytes, the ratio for us is more compelling
still, before we even begin the sort).

I'm not entirely sure that the fact that I now pass down outer plan
estimated plan rows in the case of ExecSort(), to hint to the
bttext_abort() heuristics so it has a "sense of proportion" about
costs is the right thing. It's a difficult problem to judge, so I've
left that in despite some misgivings. In any case, I think that this
revision isn't too far from the best of both worlds. I do have some
concerns that I don't have things well balanced within bttext_abort(),
but it's a start. It's hard to exactly characterize the break even
point here, particularly given that we only know about cardinality and
string length, and not distribution. Let's not lose sight of the fact
that in the real world the large majority of sorts will be on the
right side of that break even point, though. This is all about
ameliorating infrequent worst cases.

Because of the new strxfrm()-quality AC_TRY_RUN configure test, you'll
need to "autoreconf" and so on to get the patch working. I didn't
include changes to the "configure" file generated by these changes in
the patch itself.

I use "en_US.UTF8" for the purposes of all benchmark figures here. I
attach some interesting numbers for a variety of cases
(poorman-performance.txt). Worse case performance is considered.

I've been using a table of 3,17,102 cities of the world to look at
average and best case performance. Overview:

These city names are all Latin characters with accents and other
diacritics. For example:

You can download a plain format dump of this database here:
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/data/cities.dump

I also attach some DEBUG1 output that shows the accuracy of
HyperLogLog estimation (of poor man's normalized key cardinality) for
the cases tested (poorman-debug-hll-cardinality.txt). It's very
accurate given the low, fixed memory overhead chosen.

[1] https://en.wikipedia.org/wiki/HyperLogLog

[2] http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
--
Peter Geoghegan

Attachment	Content-Type	Size
poorman-performance.txt	text/plain	6.2 KB
poorman-debug-hll-cardinality.txt	text/plain	6.3 KB
poorman-hll.2014_06_05.patch	text/x-patch	51.9 KB

In response to

Re: B-Tree support function number 3 (strxfrm() optimization) at 2014-04-08 09:57:21 from Heikki Linnakangas

Responses

Re: B-Tree support function number 3 (strxfrm() optimization) at 2014-06-06 03:30:17 from Peter Geoghegan
Re: B-Tree support function number 3 (strxfrm() optimization) at 2014-06-07 03:01:59 from Peter Geoghegan
Re: B-Tree support function number 3 (strxfrm() optimization) at 2014-06-12 16:25:09 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2014-06-06 01:39:43	Re: BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby
Previous Message	David E. Wheeler	2014-06-06 00:34:03	Re: Why is it "JSQuery"?