Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog
Date: 2022-05-25 09:55:40
Message-ID: 76f80a24-fe89-c6f9-5e79-239a02e8a2b7@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5/25/22 00:16, Bruce Momjian wrote:
> On Mon, May 16, 2022 at 12:09:41AM +0200, Tomas Vondra wrote:
>> I think it's an interesting idea. In principle it allows deducing the
>> multi-column MCV for arbitrary combination of columns, not determined in
>> advance. We'd have the MCV with HLL instead of frequencies for columns
>> A, B and C:
>>
>> (a1, hll(a1))
>> (a2, hll(a2))
>> (...)
>> (aK, hll(aK))
>>
>>
>> (b1, hll(b1))
>> (b2, hll(b2))
>> (...)
>> (bL, hll(bL))
>>
>> (c1, hll(c1))
>> (c2, hll(c2))
>> (...)
>> (cM, hll(cM))
>>
>> and from this we'd be able to build MCV for any combination of those
>> three columns.
>
> Sorry, but I am lost here. I read about HLL here:
>
> https://towardsdatascience.com/hyperloglog-a-simple-but-powerful-algorithm-for-data-scientists-aed50fe47869
>
> However, I don't see how they can be combined for multiple columns.

It's the same as combining multiple HLL filters. HLL is essentially just
an array of counters, and to calculate a union (i.e. HLL for elements in
at least one of the input HLL sketches), you can just do Max() of the
counters. For intersection, you have to use inclusion-exclusion
principle, i.e.

intersection(HLL1, HLL2)
= estimate(HLL1) + estimate(HLL2) - estimate(union(HLL1,HLL2))

which is exactly the same as

P(A & B) = P(A) + P(B) - P(A | B)

There's more in:

https://github.com/citusdata/postgresql-hll

https://agkn.wordpress.com/2012/12/17/hll-intersections-2/

which also mentions the weakness - error is proportional to the size of
the union, while the intersection may be much smaller. Which might be an
issue, especially when combining multiple columns.

> Above, I know A,B,C are columns, but what is a1, a2, etc?

a1 is a value in column A, common enough to make it into the MCV. But
instead of just a frequency, we store a HLL tracking unique rows (by
adding CTID to the HLL).

So for example for a "city" column, you'd have

("NY", HLL of CTIDs for rows with city = NY)
("Philadephia", HLL of CTIDs for rows with city = Philadelphia)
...

etc.

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2022-05-25 10:13:45 Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)
Previous Message Ranier Vilela 2022-05-25 09:22:24 Re: Improving connection scalability (src/backend/storage/ipc/procarray.c)