Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From: James Coleman <jtc331(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Date: 2021-03-19 20:40:59
Message-ID: CAAaqYe8zdhuwWdhdKcg9n2_SfLe2dHZfpyfymXgGGP9mHgt5uQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 11, 2020 at 5:11 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
>
> On Tue, Sep 8, 2020 at 4:37 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> >
> > On 08/09/2020 22:25, James Coleman wrote:
> > > On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> > >>
> > >> You could also apply the optimization for non-Const expressions, as long
> > >> as they're stable (i.e. !contain_volatile_functions(expr)).
> > >
> > > Is that true? Don't we also have to worry about whether or not the
> > > value is stable (i.e., know when a param has changed)? There have been
> > > discussions about being able to cache stable subexpressions, and my
> > > understanding was that we'd need to have that infrastructure (along
> > > with the necessarily invalidation when the param changes) to be able
> > > to use this for non-Const expressions.
> >
> > Yeah, you're right, you'd have to also check for PARAM_EXEC Params. And
> > Vars. I think the conditions are the same as those checked in
> > match_clause_to_partition_key() in partprune.c (it's a long function,
> > search for "if (!IsA(rightop, Const))"). Not sure it's worth the
> > trouble, then. But it would be nice to handle queries like "WHERE column
> > = ANY ($1)"
>
> If I'm understanding properly you're imagining something in the form of:
>
> with x as (select '{1,2,3,4,5,6,7,8,9,10}'::int[])
> select * from t where i = any ((select * from x)::int[]);
>
> I've been playing around with this and currently have these checks:
>
> contain_var_clause((Node *) arrayarg)
> contain_volatile_functions((Node *) arrayarg)
> contain_exec_param((Node *) arrayarg, NIL)
>
> (the last one I had to modify the function to handle empty lists)
>
> If any are true, then have to disable the optimization. But for
> queries in the form above the test contain_exec_param((Node *)
> arrayarg, NIL) evaluates to true, even though we know from looking at
> the query that the array subexpression is stable for the length of the
> query.
>
> Am I misunderstanding what you're going for? Or is there a way to
> confirm the expr, although an exec param, won't change?
>
> Another interesting thing that this would imply is that we'd either have to:
>
> 1. Remove the array length check altogether,
> 2. Always use the hash when have a non-Const, but when a Const only if
> the array length check passes, or
> 3. Make this new expr op more fully featured by teaching it how to use
> either a straight loop through a deconstructed array or use the hash.
>
> That last option feeds into further discussion in the below:
>
> > >> Deconstructing the array Datum into a simple C array on first call would
> > >> be a win even for very small arrays and for AND semantics, even if you
> > >> don't use a hash table.
> > >
> > > Because you wouldn't have to repeatedly detoast it? Or some other
> > > reason I'm not thinking of? My intuition would have been that (aside
> > > from detoasting if necessary) there wouldn't be much real overhead in,
> > > for example, an array storing integers.
> >
> > Dealing with NULLs and different element sizes in the array is pretty
> > complicated. Looping through the array currently looks like this:
> >
> > /* Loop over the array elements */
> > s = (char *) ARR_DATA_PTR(arr);
> > bitmap = ARR_NULLBITMAP(arr);
> > bitmask = 1;
> >
> > for (int i = 0; i < nitems; i++)
> > {
> > Datum elt;
> > Datum thisresult;
> >
> > /* Get array element, checking for NULL */
> > if (bitmap && (*bitmap & bitmask) == 0)
> > {
> > fcinfo->args[1].value = (Datum) 0;
> > fcinfo->args[1].isnull = true;
> > }
> > else
> > {
> > elt = fetch_att(s, typbyval, typlen);
> > s = att_addlength_pointer(s, typlen, s);
> > s = (char *) att_align_nominal(s, typalign);
> > fcinfo->args[1].value = elt;
> > fcinfo->args[1].isnull = false;
> > }
> >
> > [do stuff with Datum/isnull]
> >
> > /* advance bitmap pointer if any */
> > if (bitmap)
> > {
> > bitmask <<= 1;
> > if (bitmask == 0x100)
> > {
> > bitmap++;
> > bitmask = 1;
> > }
> > }
> > }
> >
> > Compared with just:
> >
> > for (int i = 0; i < nitems; i++)
> > {
> > Datum elt = datums[i];
> >
> > [do stuff with the Datum]
> > }
> >
> > I'm not sure how much difference that makes, but I presume it's not
> > zero, and it seems like an easy win when you have the code to deal with
> > the Datum array representation anyway.
>
> Doing this would necessitate option 3 above: we'd have to have this
> new expr op be able both to use a hash or alternatively do a normal
> loop.
>
> Being able to use this in more cases than just a Const array expr is
> certainly interesting, but I'm not sure yet about the feasibility or
> desirability of that at this point given the above restrictions.
>
> One other point in favor of the additional complexity here is that
> it's likely that the above described runtime switching between hash
> and loop would be necessary (for this optimization to come into play)
> if caching of stable subexpressions ever lands. I have some interest
> in working on that...but it's also a large project.

I've attached a cleaned up patch. Last CF it was listed in is
https://commitfest.postgresql.org/29/2542/ -- what's the appropriate
step to take here given it's an already existing patch, but not yet
moved into recent CFs?

Thanks,
James

Attachment Content-Type Size
v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch application/octet-stream 18.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2021-03-19 20:49:43 Re: Speeding up GIST index creation for tsvectors
Previous Message Andres Freund 2021-03-19 20:40:46 Re: [HACKERS] Custom compression methods