Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From: James Coleman <jtc331(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Date: 2020-09-11 21:11:34
Message-ID: CAAaqYe-7JEece1P8WOxN6Vjz-SPY_8p_nKhVxtN+9LqNdmMdrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 8, 2020 at 4:37 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> On 08/09/2020 22:25, James Coleman wrote:
> > On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> >>
> >> You could also apply the optimization for non-Const expressions, as long
> >> as they're stable (i.e. !contain_volatile_functions(expr)).
> >
> > Is that true? Don't we also have to worry about whether or not the
> > value is stable (i.e., know when a param has changed)? There have been
> > discussions about being able to cache stable subexpressions, and my
> > understanding was that we'd need to have that infrastructure (along
> > with the necessarily invalidation when the param changes) to be able
> > to use this for non-Const expressions.
>
> Yeah, you're right, you'd have to also check for PARAM_EXEC Params. And
> Vars. I think the conditions are the same as those checked in
> match_clause_to_partition_key() in partprune.c (it's a long function,
> search for "if (!IsA(rightop, Const))"). Not sure it's worth the
> trouble, then. But it would be nice to handle queries like "WHERE column
> = ANY ($1)"

If I'm understanding properly you're imagining something in the form of:

with x as (select '{1,2,3,4,5,6,7,8,9,10}'::int[])
select * from t where i = any ((select * from x)::int[]);

I've been playing around with this and currently have these checks:

contain_var_clause((Node *) arrayarg)
contain_volatile_functions((Node *) arrayarg)
contain_exec_param((Node *) arrayarg, NIL)

(the last one I had to modify the function to handle empty lists)

If any are true, then have to disable the optimization. But for
queries in the form above the test contain_exec_param((Node *)
arrayarg, NIL) evaluates to true, even though we know from looking at
the query that the array subexpression is stable for the length of the
query.

Am I misunderstanding what you're going for? Or is there a way to
confirm the expr, although an exec param, won't change?

Another interesting thing that this would imply is that we'd either have to:

1. Remove the array length check altogether,
2. Always use the hash when have a non-Const, but when a Const only if
the array length check passes, or
3. Make this new expr op more fully featured by teaching it how to use
either a straight loop through a deconstructed array or use the hash.

That last option feeds into further discussion in the below:

> >> Deconstructing the array Datum into a simple C array on first call would
> >> be a win even for very small arrays and for AND semantics, even if you
> >> don't use a hash table.
> >
> > Because you wouldn't have to repeatedly detoast it? Or some other
> > reason I'm not thinking of? My intuition would have been that (aside
> > from detoasting if necessary) there wouldn't be much real overhead in,
> > for example, an array storing integers.
>
> Dealing with NULLs and different element sizes in the array is pretty
> complicated. Looping through the array currently looks like this:
>
> /* Loop over the array elements */
> s = (char *) ARR_DATA_PTR(arr);
> bitmap = ARR_NULLBITMAP(arr);
> bitmask = 1;
>
> for (int i = 0; i < nitems; i++)
> {
> Datum elt;
> Datum thisresult;
>
> /* Get array element, checking for NULL */
> if (bitmap && (*bitmap & bitmask) == 0)
> {
> fcinfo->args[1].value = (Datum) 0;
> fcinfo->args[1].isnull = true;
> }
> else
> {
> elt = fetch_att(s, typbyval, typlen);
> s = att_addlength_pointer(s, typlen, s);
> s = (char *) att_align_nominal(s, typalign);
> fcinfo->args[1].value = elt;
> fcinfo->args[1].isnull = false;
> }
>
> [do stuff with Datum/isnull]
>
> /* advance bitmap pointer if any */
> if (bitmap)
> {
> bitmask <<= 1;
> if (bitmask == 0x100)
> {
> bitmap++;
> bitmask = 1;
> }
> }
> }
>
> Compared with just:
>
> for (int i = 0; i < nitems; i++)
> {
> Datum elt = datums[i];
>
> [do stuff with the Datum]
> }
>
> I'm not sure how much difference that makes, but I presume it's not
> zero, and it seems like an easy win when you have the code to deal with
> the Datum array representation anyway.

Doing this would necessitate option 3 above: we'd have to have this
new expr op be able both to use a hash or alternatively do a normal
loop.

Being able to use this in more cases than just a Const array expr is
certainly interesting, but I'm not sure yet about the feasibility or
desirability of that at this point given the above restrictions.

One other point in favor of the additional complexity here is that
it's likely that the above described runtime switching between hash
and loop would be necessary (for this optimization to come into play)
if caching of stable subexpressions ever lands. I have some interest
in working on that...but it's also a large project.

James

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2020-09-11 21:43:48 Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)
Previous Message Tom Lane 2020-09-11 20:51:50 Re: factorial function/phase out postfix operators?