Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Date: 2020-04-28 12:25:32
Message-ID: 20200428122532.r2cgg2uesyxpecer@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 27, 2020 at 09:04:09PM -0400, James Coleman wrote:
>On Sun, Apr 26, 2020 at 7:41 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
>>
>> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> >
>> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
>> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
>> > ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> > >>
>> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
>> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>> > >> >>
>> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I
>> > >> >> > think ran into mostly the same challenge of deciding when the bloom
>> > >> >> > filter can be useful and is worth the extra work.
>> > >> >>
>> > >> >> Speaking of that, it would be interesting to see how a test where you
>> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
>> > >> >> be interesting to know if the planner is able to make a more suitable
>> > >> >> choice and also to see how all the work over the years to improve Hash
>> > >> >> Joins compares to the bsearch with and without the bloom filter.
>> > >> >
>> > >> >It would be interesting.
>> > >> >
>> > >> >It also makes one wonder about optimizing these into to hash
>> > >> >joins...which I'd thought about over at [1]. I think it'd be a very
>> > >> >significant effort though.
>> > >> >
>> > >>
>> > >> I modified the script to also do the join version of the query. I can
>> > >> only run it on my laptop at the moment, so the results may be a bit
>> > >> different from those I shared before, but it's interesting I think.
>> > >>
>> > >> In most cases it's comparable to the binsearch/bloom approach, and in
>> > >> some cases it actually beats them quite significantly. It seems to
>> > >> depend on how expensive the comparison is - for "int" the comparison is
>> > >> very cheap and there's almost no difference. For "text" the comparison
>> > >> is much more expensive, and there are significant speedups.
>> > >>
>> > >> For example the test with 100k lookups in array of 10k elements and 10%
>> > >> match probability, the timings are these
>> > >>
>> > >> master: 62362 ms
>> > >> binsearch: 201 ms
>> > >> bloom: 65 ms
>> > >> hashjoin: 36 ms
>> > >>
>> > >> I do think the explanation is fairly simple - the bloom filter
>> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
>> > >> some overhead to build and check the bits. The hash join probably
>> > >> eliminates a lot of the remaining comparisons, because the hash table
>> > >> is sized to have one tuple per bucket.
>> > >>
>> > >> Note: I also don't claim the PoC has the most efficient bloom filter
>> > >> implementation possible. I'm sure it could be made faster.
>> > >>
>> > >> Anyway, I'm not sure transforming this to a hash join is worth the
>> > >> effort - I agree that seems quite complex. But perhaps this suggest we
>> > >> should not be doing binary search and instead just build a simple hash
>> > >> table - that seems much simpler, and it'll probably give us about the
>> > >> same benefits.
>> > >
>> > >That's actually what I originally thought about doing, but I chose
>> > >binary search since it seemed a lot easier to get off the ground.
>> > >
>> >
>> > OK, that makes perfect sense.
>> >
>> > >If we instead build a hash is there anything else we need to be
>> > >concerned about? For example, work mem? I suppose for the binary
>> > >search we already have to expand the array, so perhaps it's not all
>> > >that meaningful relative to that...
>> > >
>> >
>> > I don't think we need to be particularly concerned about work_mem. We
>> > don't care about it now, and it's not clear to me what we could do about
>> > it - we already have the array in memory anyway, so it's a bit futile.
>> > Furthermore, if we need to care about it, it probably applies to the
>> > binary search too.
>> >
>> > >I was looking earlier at what our standard hash implementation was,
>> > >and it seemed less obvious what was needed to set that up (so binary
>> > >search seemed a faster proof of concept). If you happen to have any
>> > >pointers to similar usages I should look at, please let me know.
>> > >
>> >
>> > I think the hash join implementation is far too complicated. It has to
>> > care about work_mem, so it implements batching, etc. That's a lot of
>> > complexity we don't need here. IMO we could use either the usual
>> > dynahash, or maybe even the simpler simplehash.
>> >
>> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the
>> > benchmarks makes sense and running it independently. I'm not aware of
>> > any issues but it was done late at night and only ran on my laptop.
>>
>> Some quick calculations (don't have the scripting in a form I can
>> attach yet; using this as an opportunity to hack on a genericized
>> performance testing framework of sorts) suggest your results are
>> correct. I was also testing on my laptop, but I showed 1.) roughly
>> equivalent results for IN (VALUES ...) and IN (<list>) for integers,
>> but when I switch to (short; average 3 characters long) text values I
>> show the hash join on VALUES is twice as fast as the binary search.
>>
>> Given that, I'm planning to implement this as a hash lookup and share
>> a revised patch.
>
>I've attached a patch series as before, but with an additional patch
>that switches to using dynahash instead of binary search.
>

OK. I can't take closer look at the moment, I'll check later.

Any particular reasons to pick dynahash over simplehash? ISTM we're
using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
while there are not many places using dynahash for simple short-lived
hash tables. Of course, that alone is a weak reason to insist on using
simplehash here, but I suppose there were reasons for not using dynahash
and we'll end up facing the same issues here.

>Whereas before the benchmarking ended up with a trimodal distribution
>(i.e., master with IN <list>, patch with IN <list>, and either with IN
>VALUES), the hash implementation brings us back to an effectively
>bimodal distribution -- though the hash scalar array op expression
>implementation for text is about 5% faster than the hash join.
>

Nice. I'm not surprised this is a bit faster than hash join, which has
to worry about additional stuff.

>Current outstanding thoughts (besides comment/naming cleanup):
>
>- The saop costing needs to be updated to match, as Tomas pointed out.
>
>- Should we be concerned about single execution cases? For example, is
>the regression of speed on a simple SELECT x IN something we should
>try to defeat by only kicking in the optimization if we execute in a
>loop at least twice? That might be of particular interest to pl/pgsql.
>

I don't follow. How is this related to the number of executions and/or
plpgsql? I suspect you might be talking about prepared statements, but
surely the hash table is built for each execution anyway, even if the
plan is reused, right?

I think the issue we've been talking about is considering the number of
lookups we expect to do in the array/hash table. But that has nothing to
do with plpgsql and/or multiple executions ...

>- Should we have a test for an operator with a non-strict function?
>I'm not aware of any built-in ops that have that characteristic; would
>you suggest just creating a fake one for the test?
>

Dunno, I haven't thought about this very much. In general I think the
new code should simply behave the same as the current code, i.e. if it
does not check for strictness, we don't need either.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2020-04-28 12:39:18 Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Previous Message Robert Haas 2020-04-28 12:18:59 Re: More efficient RI checks - take 2