Skip site navigation (1) Skip section navigation (2)

Re: Regression in IN( field, field, field ) performance

From: Decibel! <decibel(at)decibel(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Regression in IN( field, field, field ) performance
Date: 2008-10-25 05:32:30
Message-ID: D28FCC7D-D2EA-441D-8921-8214EC8B3BC7@decibel.org (view raw or flat)
Thread:
Lists: pgsql-hackers
On Oct 23, 2008, at 11:16 AM, Tom Lane wrote:
> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>>> Works fine for me, eg
>
>> I think he's looking for something like:
>>  5 IN (col1,col2,col3)
>> resulting in a bitmap or of three index scans of three different  
>> indexes on
>> col1, col2, and col3.
>
> Ah, I see.  It would be easy to make transformAExprIn() generate an OR
> tree instead of = ANY(ARRAY[]), if we could figure out the conditions
> where an OR tree is superior.  I'm not sure it's easy to tell though.
> Is it sufficient to do this when there are Vars on the right side and
> none on the left?

There's 6 cases here, in a 2x3 array. In one dimension, the LHS can  
be either a Var or a fixed value. In the other dimension, the three  
possibilities are 1: everything on the RHS is a fixed value, 2: some  
fixed, some not, 3: everything on the RHS is a variable:

                   1          2         3
               ------ Right Hand Side -------
A: LHS fixed  All fixed   Mixture   All var.
B: LHS var.   All fixed   Mixture   All var.

For A2 and A3, an OR is probably best. There's no way I can think of  
to optimize A3 with an array, and with A2 you could get lucky and hit  
something like 1 = 1. Hopefully the planner would check all the fixed  
cases first.

For A1, an array might be best; it depends on if it's cheaper to  
build a huge OR clause and evaluate, or to iterate through the array,  
and that could depend on the number of terms.

B1 might actually be similar to A1... was testing done to see if ORs  
were faster for a small number of elements?

For B3, the only use-case I can think of is comparing fields within a  
record, and I can't see that resulting in a really large number of  
terms (which would presumabbly favor an array). But if you turned it  
into ORs, the planner could decide that it's better to use an index  
on some/all of the terms on the RHS. That could end up being far  
faster than using an array. An example would be field_in_small_table  
IN ( field_a_in_large_table, field_b_in_large_table,  
field_c_in_large_table ).

One final note: A2 and B2 could be treated as a combination. Treat  
all the RHS fixed values as you would A1/B1, treat all the RHS  
variables as you would A3/B3, and OR the results.

Ideally, the planner would understand the costs associated with how  
many terms are involved and would act accordingly. But I don't know  
that we can make it accurate enough to do that.

I think that the A3 and B3 cases should always be OR'd. Treating as  
an array just ties the planner's hands too much.

Presumably A1/B1 should be done with arrays, otherwise we wouldn't  
have moved away from ORs to begin with.

That leaves the mixed RHS case. If it's cheap to just split things  
into two piles (fixed RHS vs variable RHS) then that's probably the  
way to go. Ideally, each condition would then be estimated  
separately, and the executor would favor executing the cheaper one  
first.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828


In response to

Responses

pgsql-hackers by date

Next:From: Decibel!Date: 2008-10-25 05:38:10
Subject: Re: minimal update
Previous:From: Tom LaneDate: 2008-10-25 05:18:34
Subject: Re: Handling NULL records in plpgsql

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group