Re: PoC/WIP: Extended statistics on expressions

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PoC/WIP: Extended statistics on expressions
Date: 2021-03-17 21:30:59
Message-ID: ce50ee26-7135-c435-cae8-073ad59eba60@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/17/21 9:58 PM, Dean Rasheed wrote:
> On Wed, 17 Mar 2021 at 20:48, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>>
>> For reference, here is the test case I was using (which isn't really very good for
>> catching dependence between columns):
>>
>
> And here's a test case with much more dependence between the columns:
>
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (a int, b int, c int, d int);
> INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,100000) x;
> SELECT COUNT(DISTINCT a) FROM foo; -- 2
> SELECT COUNT(DISTINCT b) FROM foo; -- 5
> SELECT COUNT(DISTINCT c) FROM foo; -- 10
> SELECT COUNT(DISTINCT d) FROM foo; -- 15
> SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6
> SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20
> SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10
> SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30
>
> -- First case: stats on [(a+b),c]
> CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo;
> ANALYSE foo;
> EXPLAIN ANALYSE
> SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
> -- Estimate = 150, Actual = 30
> -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15,
> -- which is much better than ndistinct((a+b)) * ndistinct(c) *
> ndistinct(d) = 6*10*15 = 900
> -- Estimate with no stats = 1500
>
> -- Second case: stats on (c+d) as well
> CREATE STATISTICS s2 ON (c+d) FROM foo;
> ANALYSE foo;
> EXPLAIN ANALYSE
> SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
> -- Estimate = 120, Actual = 30
> -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20
>
> Again, I'd say the current behaviour is pretty good.
>

Thanks!

I agree applying at least the [(a+b),c] stats is probably the right
approach, as it means we're considering at least the available
information about dependence between the columns.

I think to improve this, we'll need to teach the code to use overlapping
statistics, a bit like conditional probability. In this case we might do
something like this:

ndistinct((a+b),c) * (ndistinct((c+d)) / ndistinct(c))

Which in this case would be either, for the "less correlated" case

228 * 24 / 12 = 446 (actual = 478, current estimate = 480)

or, for the "more correlated" case

10 * 20 / 10 = 20 (actual = 30, current estimate = 120)

But that's clearly a matter for a future patch, and I'm sure there are
cases where this will produce worse estimates.

Anyway, I plan to go over the patches one more time, and start pushing
them sometime early next week. I don't want to leave it until the very
last moment in the CF.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2021-03-17 21:43:31 Re: WIP: WAL prefetch (another approach)
Previous Message Dean Rasheed 2021-03-17 20:58:22 Re: PoC/WIP: Extended statistics on expressions