Re: Macro customizable hashtable / bitmapscan & aggregation perf

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date: 2016-10-01 23:37:35
Message-ID: 0b71d15b-89ef-bd4f-d653-0a118b550601@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/01/2016 09:59 PM, Andres Freund wrote:
> Hi,
>
> On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote:
>> On 10/01/2016 02:44 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2016-07-26 17:43:33 -0700, Andres Freund wrote:
>>>> In the attached patch I've attached simplehash.h, which can be
>>>> customized by a bunch of macros, before being inlined. There's also a
>>>> patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
>>>> execGrouping.c.
>>>
>>> Attached is a significantly improved version of this series. The main
>>> changes are:
>>>
>>> - The hash table now uses robin-hood style hash collision handling. See the
>>> commit message of 0002 (or simplehash.h) for an introduction. That
>>> significantly reduces the impact of "clustering" due to linear
>>> addressing.
>>
>> Interesting. Have you looked at cuckoo hashing?
>
> Yes. I don't think it's a good fit for modern CPUs. Because it
> doesn't probe linearly you constantly have random uncached memory
> accesses.
>
> I've played with a few schemes, and they all seemed to be slower
> than linear probing deviations, unless you intentionally used a
> wrong hash-distribution, or intentionally "unbalanced" the hash-space
> by iterating over either end of the keyspace and moved the entries
> around.

OK, understood.

>
>>> - Significant comment and correctness fixes, both in simplehash.h
>>> - itself, and 0003/0004.
>>> - a lot of other random performance improvements for the hashing code.
>>>
>>>
>>> Unfortunately I'm running out battery right now, so I don't want to
>>> re-run the benchmarks posted upthread (loading takes a while). But the
>>> last time I ran them all the results after the patches were better than
>>> before.
>>>
>>
>> Well, I have rather bad experience with running benchmarks on laptops anyway
>> - a lot of noise due to power management, etc.
>
> Well, with factor ~2x improvements thats not really a big detractor.
> Using a new CPU makes some forms of analysis easier (better PMUs).
>

True.

>
> For longrunning tests I agree.
>
>
>> What about running a bigger benchmark - say, TPC-DS - and evaluating
>> the impact?
>
> Worthwhile, although obviously the impact will be a lot smaller,
> since they're not entirely bottlenecked on hash-aggs and bitmap
> scans.
>

Sure, the improvement won't be as significant as on the simple queries,
but it's interesting IMHO.

>
>> I think a crucial part of the benchmarking will be identifying and
>> measuring corner cases, e.g. a lot of rows with the same key, etc.
>> Although that probably is not a major issue for the two places
>> switched to the new implementation (e.g. execGrouping merges the
>> duplicates to a single group, by definition).
>
> Hm. I don't really see a case where that's going to cause issues - all
> the execGrouping.c users only store one key, and either ignore later
> ones, or add the data to the initial tuple in the group. I don't really
> see any case where repeated keys should cause an issue for a hash table?
>

Yep, that's pretty much what I suggested. I was wondering more about the
other places where this hash table might be used - I've been thinking
about hashjoins, but now that I think of it - that's a fairly
specialized and tuned code. In any case, let's not complicate the
discussion for now.

>
>>> This patch series currently consists out of four patches:
>>> - 0001 - add unlikely/likely() macros. The use here is not entirely
>>> mandatory, but they do provide a quite consistent improvement. There
>>> are other threads where they proved to be beneficial, so I see little
>>> reason not to introduce them.
>>> - 0002 - the customizable hashtable itself. It now contains documentation.
>>> - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
>>> for the issue mentioned in [1], to improve peformance in heavily lossy
>>> scenarios (otherwise we could regress in some extreme cases)
>>> - 0004 - convert execGrouping.c, e.g. used by hash aggregates
>>>
>>>
>>> While not quite perfect yet, I do think this is at a state where input
>>> is needed. I don't want to go a lot deeper into this rabbit hole,
>>> before we have some agreement that this is a sensible course of action.
>>>
>>
>> So, is it the right time to do some benchmarking?
>
> That's one thing that's required, yes. The other is whether we can live
> with the uglyness of implementing templates via macros. I do think we
> can, but...
>

Hmmm ... not sure. If one of the points is to get rid of function calls
determined at runtime (which make it impossible for the compiler to
optimize the code), then I can think of three options:

(a) copy-paste the code for each place
(b) use some templating
(c) use JIT

I think (b) is way better than (a), and I don't think we're using JIT
anywhere at this point. So while the macro-based templates look a bit
awkward, I'm not aware of a better C thing.

A few comments, after quickly skimming through the first two patches:

1) SH_CREATE does this:

/* round up size to the next power of 2, eases lookups */
if (size < 2)
size = 2;
else
size = sh_pow2_int(size);

I very much doubt a hash table with 2 buckets is very useful. What about
using some reasonable lower boundary - IIRC hashjoins use 1024 buckets,
which might be a bit too much, but perhaps 256 would work?

2) tb->resize_above

I'd say 'resize_threshold' is a better name. Also, 0.8 should be defined
as a constant somewhere (not sure if it makes sense to make it part of
SH_TYPE, so that hash tables may use different load factors).

3) 'size' is a rather poor name for parameter (e.g. in SH_CREATE or
SH_RESIZE), as it's unclear whether it's 'element size in bytes', 'total
size in bytes' or 'number of entries'.

4) SH_CONTAINS sounds more like a function checking whether a hash table
contains a key, not as a 'type of hash table entry'. What about
SH_ENTRY_TYPE instead? Also, SH_KEYTYPE should be SH_KEY_TYPE I guess.

5) SH_RESIZE

Do we expect to shrink hash tables? If not, SH_RESIZE should probably
check that newsize > oldsize. If yes, it should check that there's
enough space for all entries (with the load factor).

It's not entirely clear why is it guaranteed that there's always an
element with optimal position (when load factor < 1)? Adding an
explanation or a link to a paper would be helpful, I guess.

6) SH_STAT

This bit is a bit suspicious:

uint32 max_collisions = 0;

...

/* single contained element is not a collision */
curcoll--;
total_collisions += curcoll;
if (curcoll > max_collisions)
max_collisions = curcoll - 1;

Shouldn't the last line be just "max_collisions = curcoll"?

7) Should hash agg size estimation for the planner consider the fillfactor?

I think we should account for fillfactor - we should account for the
allocated memory, even if there's free space in the hash table. After
all, that's what hashjoins do.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2016-10-01 23:53:16 Re: PATCH: two slab-like memory allocators
Previous Message John Gorman 2016-10-01 22:23:16 Re: PATCH: two slab-like memory allocators