Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?

From: Dmitry Astapov <dastapov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Date: 2021-05-12 17:56:00
Message-ID: CAFQUnFiQJJctiZd_vO5Tt8yN=H9Pwpg4AfzcyRCCmEz6jnmAvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 12, 2021 at 4:54 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Dmitry Astapov <dastapov(at)gmail(dot)com> writes:
> > I am trying to understand the behaviour of the query planner regarding
> the
> > push-down of the conditions "through" the join.
>
> I think your mental model is wrong. What's actually happening here is
> that the planner uses equivalence classes to deduce implied conditions.
> That is, we have the join condition a.adate = b.bdate and then you've
> added the where condition a.adate = '2021-05-12'. Transitivity implies
> that b.bdate = '2021-05-12', so we deduce that condition and are able
> to apply it at the relation scan of b. Furthermore, having restricted
> both a.adate and b.bdate to the same constant value at the scan level,
> we no longer need to apply the join condition a.adate = b.bdate at all.
> This is important not only to avoid the (probably minor) inefficiency
> of rechecking the join condition, but because if we believed that all
> three conditions were independently applicable, we'd come out with a
> serious underestimate of the size of the join result.
>

Thank you very much, my mental model was indeed incorrect, and the above is
very helpful.
Am I right in thinking that elimination the join condition is actually
quite important part of the process?
Could it possibly be the main reason for =ANY/(x IN (..)) not to be
optimized the same way?

>
> > In my experiments, I was never able to get an execution plan that "pushes
> > down" any condition apart from (=) through to the right side of the join,
>
> None of the argument sketched above works for non-equality conditions.
> There are some situations where you could probably figure out how to
> use transitivity to deduce some implied condition, but cleaning things
> up so that you don't have redundant conditions fouling up the join
> size estimates seems like a hard problem.
>

I agree about inequality conditions, this problem seems to be rather hard
to tackle in the general case.

Is it still hard when one thinks about =ANY or (column in (val1, val2,
val3, ...)) as well?
I am thinking that =ANY would be a decent workaround for (x BETWEEN a AND
b) in quite a lot of cases, if it was propagated to all the columns in the
equivalence class.

> > Equally surprising is that I was unable to find documentation or past
> > mailing list discussions of this or similar topic, which leads me to
> > believe that I am just not familiar with the proper terminology and can't
> > come up with the right search terms.
>
> src/backend/optimizer/README has a discussion of equivalence classes.
>
Thank you, this gives me a plethora of keywords for further searches.

I realize that it is possibly off-topic here, but what about workarounds
for inequality constraints, joins and views? Maybe you could give me some
pointers here as well?

My tables are large to huge (think OLAP, not OLTP). I found out when I have
a view that joins several (2 to 10) tables on the column that is
semantically the same in all of them (let's say it is ID and we join on
ID), I do not have many avenues to efficiently select from such view for a
list of IDs at the same time.

I could:
1) Do lots of fast queries and union them:
select * from vw where id=ID1 union all select * from vw where id=ID2
....., which is only really feasible if the query is generated by the
program

2)expose all ID columns from all the tables used in the view body and do:
select * from vw where id=ANY() and id1=ANY() and id2=ANY() and id3=ANY()
.....
This only works well if the view hierarchy is flat (no views on views). If
there are other views that use this use, re-exports of extra columns
quickly snowballs, you might need column renaming if same view ends up
being used more than once through two different dependency paths. Plus
people not familiar with the problem tend to omit "clearly superfluous"
columns from the new views they build on top.

3)forbid views that join tables larger than a certain size/dismantle views
that become inefficient (this only works if the problem is detected fast
enough and the view did not become popular yet)

So all of the workarounds I see in front of me right now are somewhat sad,
but they are necessary, as not doing them means that queries would take
hours or days instead of minutes.

Is there anything better that I have not considered in terms of workarounds?

--
D. Astapov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-05-12 18:24:43 Re: Some other CLOBBER_CACHE_ALWAYS culprits
Previous Message Bruce Momjian 2021-05-12 17:39:39 Re: PG 14 release notes, first draft