Re: Hash Joins vs. Bloom Filters / take 2

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Joins vs. Bloom Filters / take 2
Date: 2018-02-22 21:14:35
Message-ID: 67862a16-e48d-e918-5fac-e988e24d19f7@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/22/2018 09:52 PM, Claudio Freire wrote:
> On Thu, Feb 22, 2018 at 5:11 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 02/22/2018 08:33 PM, Claudio Freire wrote:
>>> That's kinda slow to do per-item. I tried to "count" distinct items by
>>> checking the BF before adding (don't add redundantly), but that's less
>>> precise than a HLL in my experience.
>>
>> But you don't need to do that for each item you try to add - you know
>> that with M bits and K hash functions you can't reach 50% before at
>> least M/(2*K) additions. And you don't really need to track the number
>> of bits exactly - if it gets 55% full, it's not going to explode.
>>
>> So just wait until M/(2*K) additions, see how many bits remain until the
>> threshold - rinse and repeat. And use some reasonable minimum distance
>> (say, 5% of the capacity), not to do the check too often.
>
> Ok, that works too.
>
> Point is, SBFs help to keep the BF size as small as possible, keep it
> below work_mem, and to avoid caring about the total number of distinct
> items.
>
> You may want the planner to try and estimate that to figure out
> whether it's worth trying BFs or not, but once you decide to try,
> SBFs remove any dependency on the accuracy of that estimate.
>

OK, thanks for reminding me about SBF and for the discussion.

At this point I'll probably focus on the other parts though -
determining selectivity of the join, etc. Which I think is crucial, and
we need to get that right even for accurate estimates. It's good to know
that we have a solution for that part, though.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2018-02-22 21:23:24 Re: [HACKERS] Partition-wise aggregation/grouping
Previous Message Andres Freund 2018-02-22 21:04:46 Re: Add PGDLLIMPORT to enable_hashagg