Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dmitry Astapov <dastapov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Date: 2022-02-11 10:37:17
Message-ID: CAKU4AWpOqbhOwnxZS80DZWrc7daqAmMV7aKr7EDg3jkZcCUPOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
>>
> So I think knowing what bad it is to have this feature is the key point to
>> discussion now.
>>
>>
I re-read the discussion at 2015 [1] and the below topic is added for the
above
question. Here is the summary for easy discussion.

====
From planner aspect:

> While I've only read your description of the patch not the patch itself,
> the search methodology you propose seems pretty brute-force and
> unlikely to solve that issue. It's particularly important to avoid
O(N^2)
> behaviors when there are N expressions ...

The patch has 3 steps in general. 1). Gather the filter_qual_list during
the deconstruct_jointree. only unmergeable qual is gathered here.
2). After the root->eq_classes is built, scan each of the above quals to
find out if there is a EC match, if yes, add it to the EC. There are
some fast paths here. like ec->relids, em->em_relids. 3). compose
the qual in ec_filter and members in ec_members, then distribute it to
the relations. This step take the most cycles of this feature, and it is
the most important part for this feature as well.

Fortunately, thousands of partitions of a table would not make it worse
since they are not generated at that stage. So I'd believe the number of
ECs or EMs in an EC would be pretty small in common cases.

> time would be spent on searches for matching subexpressions whether
> or not anything was learned (and often nothing would be learned).

This is about some cases like "SELECT * FROM t1, t2 WHERE t1.a = t2.a
and t1.b > 3". In this case, we still need to go through steps 1 & 2,
all the fast
paths don't work and the equal() is unavoidable. However step 3 can be
ignored.
If we want to improve this, could we maintain an attr_eq_indexes in
RelOptInfos
which indicates if the given attribute appears in any one of EC members?

=====
From executor aspects:

> The reason why the answer isn't automatically "all of them"
> is because, first of all, it's possible that enforcing the condition
> at a particular table costs more to filter out the rows that we save
> in execution time at higher levels of the plan tree. For example,
> consider A JOIN B ON A.X = B.X WHERE A.X > 1000000. It might be that
> the range of A.X is [0,1000001] but the range of B.X is
> [1000000,2000000]; so enforcing the inequality against A is very
> selective but enforcing it against B filters out basically nothing.

I think we can classify this as we push down / execute an qual, the
qual takes lots of cycles, but it doesn't filter many rows.

> A first cut might be to enforce the inequality against the relation
> where it's believed to be most selective, equivalence-class column
> mentioned in the inequality provided that the
> selectivity is thought to be above some threshold ... but I'm not sure
> this is very principled,

I can only input +1 after some deep thoughts.

>> Furthermore, there are some cases involving parameterized paths where
>> enforcing the inequality multiple times is definitely bad: for
>> example, if we've got a nested loop where the outer side is a seq scan
>> that enforces the condition and the inner side is an index probe, it
>> is just a waste to retest it on the inner side. We already know that
>> the outer row passes the inequality, so the inner row will necessarily
>> pass also. This doesn't apply to merge or hash joins, and it also
>> doesn't apply to all nested loops: scans that aren't paramaterized by
>> the equivalence-class column can still benefit from separate
>> enforcement of the inequality.
>>
> I guess that could be fixed by somehow marking these pushed quals as
> optional and having parameterised scans ignore optional quals.

This has been done by committing 4.

> Now, all that having been said, I think this is a pretty important
> optimization. Lots of people have asked for it, and I think it would
> be worth expending some brainpower to try to figure out a way to be
> smarter than we are now, which is, in a nutshell, as dumb as possible.

+1. I asked custom to add the derivable quals manually for 10+ of table
each query last year and gained great results.

Anyone still have interest in this? Or is a better solution really
possible?
Or is the current method too bad to rescue?

--
Best Regards
Andy Fan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-02-11 10:57:31 Re: [BUG]Update Toast data failure in logical replication
Previous Message Dilip Kumar 2022-02-11 10:02:45 Assertion failure in WaitForWALToBecomeAvailable state machine