Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Date: 2020-04-26 20:49:27
Message-ID: 20200426204927.bznc6vqefocsav53@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
>On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
>> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>> >>
>> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> >> > This reminds me our attempts to add bloom filters to hash joins, which I
>> >> > think ran into mostly the same challenge of deciding when the bloom
>> >> > filter can be useful and is worth the extra work.
>> >>
>> >> Speaking of that, it would be interesting to see how a test where you
>> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
>> >> be interesting to know if the planner is able to make a more suitable
>> >> choice and also to see how all the work over the years to improve Hash
>> >> Joins compares to the bsearch with and without the bloom filter.
>> >
>> >It would be interesting.
>> >
>> >It also makes one wonder about optimizing these into to hash
>> >joins...which I'd thought about over at [1]. I think it'd be a very
>> >significant effort though.
>> >
>>
>> I modified the script to also do the join version of the query. I can
>> only run it on my laptop at the moment, so the results may be a bit
>> different from those I shared before, but it's interesting I think.
>>
>> In most cases it's comparable to the binsearch/bloom approach, and in
>> some cases it actually beats them quite significantly. It seems to
>> depend on how expensive the comparison is - for "int" the comparison is
>> very cheap and there's almost no difference. For "text" the comparison
>> is much more expensive, and there are significant speedups.
>>
>> For example the test with 100k lookups in array of 10k elements and 10%
>> match probability, the timings are these
>>
>> master: 62362 ms
>> binsearch: 201 ms
>> bloom: 65 ms
>> hashjoin: 36 ms
>>
>> I do think the explanation is fairly simple - the bloom filter
>> eliminates about 90% of the expensive comparisons, so it's 20ms plus
>> some overhead to build and check the bits. The hash join probably
>> eliminates a lot of the remaining comparisons, because the hash table
>> is sized to have one tuple per bucket.
>>
>> Note: I also don't claim the PoC has the most efficient bloom filter
>> implementation possible. I'm sure it could be made faster.
>>
>> Anyway, I'm not sure transforming this to a hash join is worth the
>> effort - I agree that seems quite complex. But perhaps this suggest we
>> should not be doing binary search and instead just build a simple hash
>> table - that seems much simpler, and it'll probably give us about the
>> same benefits.
>
>That's actually what I originally thought about doing, but I chose
>binary search since it seemed a lot easier to get off the ground.
>

OK, that makes perfect sense.

>If we instead build a hash is there anything else we need to be
>concerned about? For example, work mem? I suppose for the binary
>search we already have to expand the array, so perhaps it's not all
>that meaningful relative to that...
>

I don't think we need to be particularly concerned about work_mem. We
don't care about it now, and it's not clear to me what we could do about
it - we already have the array in memory anyway, so it's a bit futile.
Furthermore, if we need to care about it, it probably applies to the
binary search too.

>I was looking earlier at what our standard hash implementation was,
>and it seemed less obvious what was needed to set that up (so binary
>search seemed a faster proof of concept). If you happen to have any
>pointers to similar usages I should look at, please let me know.
>

I think the hash join implementation is far too complicated. It has to
care about work_mem, so it implements batching, etc. That's a lot of
complexity we don't need here. IMO we could use either the usual
dynahash, or maybe even the simpler simplehash.

FWIW it'd be good to verify the numbers I shared, i.e. checking that the
benchmarks makes sense and running it independently. I'm not aware of
any issues but it was done late at night and only ran on my laptop.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2020-04-26 21:20:01 Re: Setting min/max TLS protocol in clientside libpq
Previous Message Jonathan S. Katz 2020-04-26 19:21:38 Re: Poll: are people okay with function/operator table redesign?