Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Date: 2020-04-25 12:40:24
Message-ID: 20200425124024.hsv7z6bia752uymz@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Apr 25, 2020 at 12:21:06AM +0200, Tomas Vondra wrote:
>On Thu, Apr 23, 2020 at 04:55:51PM +0200, Tomas Vondra wrote:
>>On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
>>>On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
>>><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>>On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
>>>>>Over in "execExprInterp() questions / How to improve scalar array op
>>>>>expr eval?" [1] I'd mused about how we might be able to optimized
>>>>>scalar array ops with OR'd semantics.
>>>>>This patch implements a binary search for such expressions when the
>>>>>array argument is a constant so that we can avoid needing to teach
>>>>>expression execution to cache stable values or know when a param has
>>>>>The speed-up for the target case can pretty impressive: in my
>>>>>admittedly contrived and relatively unscientific test with a query in
>>>>>the form:
>>>>>select count(*) from generate_series(1,100000) n(i) where i in (<1000
>>>>>random integers in the series>)
>>>>>shows ~30ms for the patch versus ~640ms on master.
>>>>Nice improvement, although 1000 items is probably a bit unusual. The
>>>>threshold used in the patch (9 elements) seems a bit too low - what
>>>>results have you seen with smaller arrays?
>>>At least in our systems we regularly work with 1000 batches of items,
>>>which means you get IN clauses of identifiers of that size. Admittedly
>>>the most common case sees those IN clauses as simple index scans
>>>(e.g., WHERE <primary key> IN (...)), but it's also common to have a
>>>broader query that merely filters additionally on something like "...
>>>AND <some foreign key> IN (...)" where it makes sense for the rest of
>>>the quals to take precedence in generating a reasonable plan. In that
>>>case, the IN becomes a regular filter, hence the idea behind the
>>>Side note: I'd love for us to be able to treat "IN (VALUES)" the same
>>>way...but as noted in the other thread that's an extremely large
>>>amount of work, I think. But similarly you could use a hash here
>>>instead of a binary search...but this seems quite good.
>>>As to the choice of 9 elements: I just picked that as a starting
>>>point; Andres had previously commented off hand that at 8 elements
>>>serial scanning was faster, so I figured this was a reasonable
>>>starting point for discussion.
>>>Perhaps it would make sense to determine that minimum not as a pure
>>>constant but scaled based on how many rows the planner expects us to
>>>see? Of course that'd be a more invasive may or may not be
>>>as feasible as a reasonable default.
>>Not sure. That seems a bit overcomplicated and I don't think it depends
>>on the number of rows the planner expects to see very much. I think we
>>usually assume the linear search is cheaper for small arrays and then at
>>some point the binary search starts winning The question is where this
>>"break even" point is.
>>I think we usually use something like 64 or so in other places, but
>>maybe I'm wrong. The current limit 9 seems a bit too low, but I may be
>>wrong. Let's not obsess about this too much, let's do some experiments
>>and pick a value based on that.
>>>>Another idea - would a bloom filter be useful here, as a second
>>>>optimization? That is, for large arrays build s small bloom filter,
>>>>allowing us to skip even the binary search.
>>>That's an interesting idea. I actually haven't personally worked with
>>>bloom filters, so didn't think about that.
>>>Are you thinking that you'd also build the filter *and* presort the
>>>array? Or try to get away with using only the bloom filter and not
>>>expanding and sorting the array at all?
>>Yeah, something like that. My intuition is the bloom filter is useful
>>only above some number of items, and the number is higher than for the
>>binary search. So we'd end up with two thresholds, first one enabling
>>binary search, the second one enabling bloom filter.
>>Of course, the "unknown" variable here is how often we actually find the
>>value in the array. If 100% of the queries has a match, then the bloom
>>filter is a waste of time. If there are no matches, it can make a
>>significant difference.
>I did experiment with this is a bit, both to get a bit more familiar
>with this code and to see if the bloom filter might help. The short
>answer is the bloom filter does not seem to help at all, so I wouldn't
>bother about it too much.
>Attacched is an updated patch series and, script I used to collect some
>performance measurements, and a spreadsheet with results. The patch
>series is broken into four parts:
> 0001 - the original patch with binary search
> 0002 - adds GUCs to enable bin search / tweak threshold
> 0003 - allows to use bloom filter + binary search
> 0004 - try using murmurhash
>The test script runs a wide range of queries with different number
>of lookups, keys in the array, match probability (i.e. fraction of
>lookups that find a match) ranging from 1% to 100%. And of course, it
>runs this with the binsearch/bloom either enabled or disabled (the
>threshold was lowered to 1, so it's the on/off GUCs that determine
>whether the binsearch/bloom is used).
>The results are summarized in the spreadsheet, demonstrating how useless
>the bloom filter is. There's not a single case where it would beat the
>binary search. I believe this is because theaccess to bloom filter is
>random (determined by the hash function) and we don't save much compared
>to the log(K) lookups in the sorted array.
>That makes sense, I think the bloom filters are meant to be used in
>cases when the main data don't fit into memory - which is not the case
>here. But I wonder how would this change for cases with more expensive
>comparisons - this was using just integers, so maybe strings would
>result in different behavior.

OK, I tried the same test with text columns (with md5 strings), and the
results are about as I predicted - the bloom filter actually makes a
difference in this case. Depending on the number of lookups and
selectivity (i.e. how many lookups have a match in the array) it can
mean additional speedup up to ~5x compared to binary search alone.

For the case with 100% selectivity (i.e. all rows have a match) this
can't really save any time - it's usually still much faster than master,
but it's a bit slower than binary search.

So I think this might be worth investigating further, once the simple
binary search gets committed. We'll probably need to factor in the cost
of the comparison (higher cost -> BF more useful), selectivity of the
filter (fewer matches -> BF more useful) and number of lookups.

This reminds me our attempts to add bloom filters to hash joins, which I
think ran into mostly the same challenge of deciding when the bloom
filter can be useful and is worth the extra work.


Tomas Vondra
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size application/x-sh 9.5 KB
saop-expensive-cmp.ods application/vnd.oasis.opendocument.spreadsheet 104.9 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2020-04-25 12:59:07 Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Previous Message David Rowley 2020-04-25 11:24:48 Re: Parallel Append can break run-time partition pruning