Re: improving GROUP BY estimation

From: Mark Dilger <hornschnorter(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: improving GROUP BY estimation
Date: 2016-02-26 00:19:39
Message-ID: B9765D22-4BDC-4764-935C-2CBE329B0455@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> On Feb 25, 2016, at 3:16 PM, Mark Dilger <hornschnorter(at)gmail(dot)com> wrote:
>
>
>> On Feb 23, 2016, at 5:12 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> <snip>
>>
>> So much better. Clearly, there are cases where this will over-estimate the cardinality - for example when the values are somehow correlated.
>>
>
> I applied your patch, which caused a few regression tests to fail. Attached
> is a patch that includes the necessary changes to the expected test results.
>
> It is not hard to write test cases where your patched version overestimates
> the number of rows by a very similar factor as the old code underestimates
> them. My very first test, which was not specifically designed to demonstrate
> this, happens to be one such example:
>
>
> CREATE TABLE t (a INT, b int);
> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
> ANALYZE t;
> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
> QUERY PLAN
> ---------------------------------------------------------------
> HashAggregate (cost=169250.21..169258.71 rows=850 width=4)
> Group Key: a
> -> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4)
> Filter: (b < 1000)
> (4 rows)
>
> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
> count
> -------
> 32
> (1 row)
>
>
>
> So, it estimates 850 rows where only 32 are returned . Without applying your patch,
> it estimates just 1 row where 32 are returned. That's an overestimate of roughly 26 times,
> rather than an underestimate of 32 times.
>
> As a patch review, I'd say that your patch does what you claim it does, and it applies
> cleanly, and passes the regression tests with my included modifications. I think there
> needs to be some discussion on the list about whether the patch is a good idea.
>
> Mark Dilger
>
>
> <estimate-num-groups-v2.txt>

Assuming the goal is to minimize the degree to which the estimates are inaccurate, I
get better results by a kind of averaging of the old values from the current code base
and the new values from Tomas's patch. I tried this and at least for the few examples
we've been throwing around, I found the results were not as wildly inaccurate in the
worst case examples than in either of the other two approaches:

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 46c95b0..c83135a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3414,6 +3414,8 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
Assert(rel->reloptkind == RELOPT_BASEREL);
if (rel->tuples > 0)
{
+ double old_factor; /* The way it is currently done on master */
+ double new_factor; /* The way Tomas Vondra proposes doing it */
/*
* Clamp to size of rel, or size of rel / 10 if multiple Vars. The
* fudge factor is because the Vars are probably correlated but we
@@ -3440,7 +3442,10 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
/*
* Multiply by restriction selectivity.
*/
- reldistinct *= rel->rows / rel->tuples;
+ old_factor = rel->rows / rel->tuples; /* old way */
+ new_factor = (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); /* Tomas's way */
+
+ reldistinct *= sqrt(old_factor * new_factor); /* average of old way and new way, sort of */

/*
* Update estimate of total distinct groups.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2016-02-26 00:34:11 Re: Writing new unit tests with PostgresNode
Previous Message Masahiko Sawada 2016-02-25 23:52:54 Re: Support for N synchronous standby servers - take 2