Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From: James Coleman <jtc331(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Date: 2020-04-24 13:38:54
Message-ID: CAAaqYe8bAfECVwRVG8YdKbQ3yuQhxCytByM4Ln_vbJcA8BWQrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 23, 2020 at 10:55 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
> >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >>
> >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
> >> >Over in "execExprInterp() questions / How to improve scalar array op
> >> >expr eval?" [1] I'd mused about how we might be able to optimized
> >> >scalar array ops with OR'd semantics.
> >> >
> >> >This patch implements a binary search for such expressions when the
> >> >array argument is a constant so that we can avoid needing to teach
> >> >expression execution to cache stable values or know when a param has
> >> >changed.
> >> >
> >> >The speed-up for the target case can pretty impressive: in my
> >> >admittedly contrived and relatively unscientific test with a query in
> >> >the form:
> >> >
> >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000
> >> >random integers in the series>)
> >> >
> >> >shows ~30ms for the patch versus ~640ms on master.
> >> >
> >>
> >> Nice improvement, although 1000 items is probably a bit unusual. The
> >> threshold used in the patch (9 elements) seems a bit too low - what
> >> results have you seen with smaller arrays?
> >
> >At least in our systems we regularly work with 1000 batches of items,
> >which means you get IN clauses of identifiers of that size. Admittedly
> >the most common case sees those IN clauses as simple index scans
> >(e.g., WHERE <primary key> IN (...)), but it's also common to have a
> >broader query that merely filters additionally on something like "...
> >AND <some foreign key> IN (...)" where it makes sense for the rest of
> >the quals to take precedence in generating a reasonable plan. In that
> >case, the IN becomes a regular filter, hence the idea behind the
> >patch.
> >
> >Side note: I'd love for us to be able to treat "IN (VALUES)" the same
> >way...but as noted in the other thread that's an extremely large
> >amount of work, I think. But similarly you could use a hash here
> >instead of a binary search...but this seems quite good.
> >
> >As to the choice of 9 elements: I just picked that as a starting
> >point; Andres had previously commented off hand that at 8 elements
> >serial scanning was faster, so I figured this was a reasonable
> >starting point for discussion.
> >
> >Perhaps it would make sense to determine that minimum not as a pure
> >constant but scaled based on how many rows the planner expects us to
> >see? Of course that'd be a more invasive patch...so may or may not be
> >as feasible as a reasonable default.
> >
>
> Not sure. That seems a bit overcomplicated and I don't think it depends
> on the number of rows the planner expects to see very much. I think we
> usually assume the linear search is cheaper for small arrays and then at
> some point the binary search starts winning The question is where this
> "break even" point is.

Well since it has to do preprocessing work (expanding the array and
then sorting it), then the number of rows processed matters, right?
For example, doing a linear search on 1000 items only once is going to
be cheaper than preprocessing the array and then doing a binary
search, but only a very large row count the binary search +
preprocessing might very well win out for only a 10 element array.

I'm not trying to argue for more work for myself here: I think the
optimization is worth it on its own, and something like this could be
a further improvement on its own. But it is interesting to think
about.

James

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-04-24 14:07:17 Re: proposal - plpgsql - all plpgsql auto variables should be constant
Previous Message tushar 2020-04-24 13:03:03 Re: [Proposal] Global temporary tables