Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Date: 2020-04-26 00:31:37
Message-ID: 20200426003137.a2vqlktwaey5gvft@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
>On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> > This reminds me our attempts to add bloom filters to hash joins, which I
>> > think ran into mostly the same challenge of deciding when the bloom
>> > filter can be useful and is worth the extra work.
>> Speaking of that, it would be interesting to see how a test where you
>> write the query as IN(VALUES(...)) instead of IN() compares. It would
>> be interesting to know if the planner is able to make a more suitable
>> choice and also to see how all the work over the years to improve Hash
>> Joins compares to the bsearch with and without the bloom filter.
>It would be interesting.
>It also makes one wonder about optimizing these into to hash
>joins...which I'd thought about over at [1]. I think it'd be a very
>significant effort though.

I modified the script to also do the join version of the query. I can
only run it on my laptop at the moment, so the results may be a bit
different from those I shared before, but it's interesting I think.

In most cases it's comparable to the binsearch/bloom approach, and in
some cases it actually beats them quite significantly. It seems to
depend on how expensive the comparison is - for "int" the comparison is
very cheap and there's almost no difference. For "text" the comparison
is much more expensive, and there are significant speedups.

For example the test with 100k lookups in array of 10k elements and 10%
match probability, the timings are these

master: 62362 ms
binsearch: 201 ms
bloom: 65 ms
hashjoin: 36 ms

I do think the explanation is fairly simple - the bloom filter
eliminates about 90% of the expensive comparisons, so it's 20ms plus
some overhead to build and check the bits. The hash join probably
eliminates a lot of the remaining comparisons, because the hash table
is sized to have one tuple per bucket.

Note: I also don't claim the PoC has the most efficient bloom filter
implementation possible. I'm sure it could be made faster.

Anyway, I'm not sure transforming this to a hash join is worth the
effort - I agree that seems quite complex. But perhaps this suggest we
should not be doing binary search and instead just build a simple hash
table - that seems much simpler, and it'll probably give us about the
same benefits.


Tomas Vondra
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size application/x-sh 15.7 KB
saop-hashjoin.ods application/vnd.oasis.opendocument.spreadsheet 36.5 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2020-04-26 01:27:48 timeout on jacana/bowerbird
Previous Message Peter Geoghegan 2020-04-26 00:17:17 Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)