From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Claudio Freire <klaussfreire(at)gmail(dot)com> |
Cc: | Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Hash Joins vs. Bloom Filters / take 2 |
Date: | 2018-02-22 21:14:35 |
Message-ID: | 67862a16-e48d-e918-5fac-e988e24d19f7@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 02/22/2018 09:52 PM, Claudio Freire wrote:
> On Thu, Feb 22, 2018 at 5:11 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 02/22/2018 08:33 PM, Claudio Freire wrote:
>>> That's kinda slow to do per-item. I tried to "count" distinct items by
>>> checking the BF before adding (don't add redundantly), but that's less
>>> precise than a HLL in my experience.
>>
>> But you don't need to do that for each item you try to add - you know
>> that with M bits and K hash functions you can't reach 50% before at
>> least M/(2*K) additions. And you don't really need to track the number
>> of bits exactly - if it gets 55% full, it's not going to explode.
>>
>> So just wait until M/(2*K) additions, see how many bits remain until the
>> threshold - rinse and repeat. And use some reasonable minimum distance
>> (say, 5% of the capacity), not to do the check too often.
>
> Ok, that works too.
>
> Point is, SBFs help to keep the BF size as small as possible, keep it
> below work_mem, and to avoid caring about the total number of distinct
> items.
>
> You may want the planner to try and estimate that to figure out
> whether it's worth trying BFs or not, but once you decide to try,
> SBFs remove any dependency on the accuracy of that estimate.
>
OK, thanks for reminding me about SBF and for the discussion.
At this point I'll probably focus on the other parts though -
determining selectivity of the join, etc. Which I think is crucial, and
we need to get that right even for accurate estimates. It's good to know
that we have a solution for that part, though.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2018-02-22 21:23:24 | Re: [HACKERS] Partition-wise aggregation/grouping |
Previous Message | Andres Freund | 2018-02-22 21:04:46 | Re: Add PGDLLIMPORT to enable_hashagg |