Quick Links

Re: why doesn't optimizer can pull up where a > ( ... )

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, xuncheng(at)google(dot)com, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: why doesn't optimizer can pull up where a > ( ... )
Date:	2019-11-21 17:15:27
Message-ID:	20191121171527.sahzqaom7yre3ii2@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Nov 21, 2019 at 11:57:22PM +0800, Andy Fan wrote:
>On Thu, Nov 21, 2019 at 6:12 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
>wrote:
>
>> On Thu, Nov 21, 2019 at 08:30:51AM +0800, Andy Fan wrote:
>> >>
>> >>
>> >> Hm. That actually raises the stakes a great deal, because if that's
>> >> what you're expecting, it would require planning out both the
>> transformed
>> >> and untransformed versions of the query before you could make a cost
>> >> comparison.
>> >
>> >
>> >I don't know an official name, let's call it as "bloom filter push down
>> >(BFPD)" for reference. this algorithm may be helpful on this case with
>> >some extra effort.
>> >
>> >First, Take . "select ... from t1, t2 where t1.a = t2.a and t1.b = 100"
>> >for example, and assume t1 is scanned before t2 scanning, like hash
>> >join/sort merge and take t1's result as inner table.
>> >
>> >1. it first scan t1 with filter t1.b = 100;
>> >2. during the above scan, it build a bloom filter *based on the join key
>> >(t1.a) for the "selected" rows.*
>> >3. during scan t2.a, it filters t2.a with the bloom filter.
>> >4. probe the the hash table with the filtered rows from the above step.
>> >
>>
>> So essentially just a hash join with a bloom filter?
>
>
>Yes, the idea is exactly same but we treat the value differently (both are
>valid, and your point is more common) . In my opinion in some
>environment like oracle exadata, it is much more powerful since it
>transfers much less data from data node to compute node.
>
>Of course, the benefit is not always, but it is a good beginning to make
>it smarter.
>

Yes, it certainly depends on the workload. As was discussed in the
other thread, to get the most benefit we'd have to push the bloom filter
down the other side of the join as far as possible, ideally to the scan
nodes. But no one tried to do that.

>
>> That doesn't seem very relevant to this thread (at least I don't see any
>> obvious link),
>>
>
>The original problem "group by p_brand" for "all the rows" maybe not a
>good idea all the time, and if we can do some filter before the group
>by, the result would be better.
>

Well, I think vast majority of optimizations depend on the data. The
reason why I think these two optimizations are quite different is that
one (blom filter with hash joins) is kinda localized and does not change
the general plan shape - you simply make the decision at the hash join
level, and that's it (although it's true it does affect row counts on
one side of the join).

The optimization discussed here is very different because it requires
transformation of the query very early, before we actually can judge if
it's a good idea or not.

>> And in some
>> cases building a bloom filter did result in nice speedups, but in other
>> cases it was just an extra overhead. But it does not require change of
>> plan shape, unlike the optimization discussed here.
>>
>
>I thought we could add a step named "build the filter" and another step as
>"apply the filter". If so, the plan shape is changed. anyway I don't
>think this is a key point.
>

Not sure. Perhaps there are similarities, but I don't see them.

>
>>
>> Ultimately there were discussions about pushing the bloom filter much
>> deeper on the non-hash side, but that was never implemented.
>
>
>Do you still have any plan about this feature since I see you raised the
>idea and and the idea was very welcomed also?
>

I'm not working on it, and I don't think I'll get to do that any time
soon. So feel free to look into the problem if you wish.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: why doesn't optimizer can pull up where a > ( ... ) at 2019-11-21 15:57:22 from Andy Fan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2019-11-21 18:20:28	Re: Remove configure --disable-float4-byval and --disable-float8-byval
Previous Message	Pavel Stehule	2019-11-21 16:51:39	Re: SQL/JSON: JSON_TABLE