Quick Links

Re: multivariate statistics (v19)

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: multivariate statistics (v19)
Date:	2017-01-04 21:57:09
Message-ID:	6f7ff2aa-b2b8-dbde-b39b-a9099f615466@2ndquadrant.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 01/04/2017 03:21 PM, Dilip Kumar wrote:
> On Wed, Jan 4, 2017 at 8:05 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Attached is v22 of the patch series, rebased to current master and fixing
>> the reported bug. I haven't made any other changes - the issues reported by
>> Petr are mostly minor, so I've decided to wait a bit more for (hopefully)
>> other reviews.
>
> v22 fixes the problem, I reported. In my test, I observed that group
> by estimation is much better with ndistinct stat.
>
> Here is one example:
>
> postgres=# explain analyze select p_brand, p_type, p_size from part
> group by p_brand, p_type, p_size;
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------
> HashAggregate (cost=37992.00..38992.00 rows=100000 width=36) (actual
> time=953.359..1011.302 rows=186607 loops=1)
> Group Key: p_brand, p_type, p_size
> -> Seq Scan on part (cost=0.00..30492.00 rows=1000000 width=36)
> (actual time=0.013..163.672 rows=1000000 loops=1)
> Planning time: 0.194 ms
> Execution time: 1020.776 ms
> (5 rows)
>
> postgres=# CREATE STATISTICS s2 WITH (ndistinct) on (p_brand, p_type,
> p_size) from part;
> CREATE STATISTICS
> postgres=# analyze part;
> ANALYZE
> postgres=# explain analyze select p_brand, p_type, p_size from part
> group by p_brand, p_type, p_size;
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------
> HashAggregate (cost=37992.00..39622.46 rows=163046 width=36) (actual
> time=935.162..992.944 rows=186607 loops=1)
> Group Key: p_brand, p_type, p_size
> -> Seq Scan on part (cost=0.00..30492.00 rows=1000000 width=36)
> (actual time=0.013..156.746 rows=1000000 loops=1)
> Planning time: 0.308 ms
> Execution time: 1001.889 ms
>
> In above example,
> Without MVStat-> estimated: 100000 Actual: 186607
> With MVStat-> estimated: 163046 Actual: 186607
>

Thanks. Those plans match my experiments with the TPC-H data set,
although I've been playing with the smallest scale (1GB).

It's not very difficult to make the estimation error arbitrary large,
e.g. by using perfectly correlated (identical) columns.

regard

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: multivariate statistics (v19) at 2017-01-04 14:21:23 from Dilip Kumar

Responses

Re: multivariate statistics (v19) at 2017-01-26 09:43:57 from Dilip Kumar

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Thomas Munro	2017-01-04 22:03:10	Re: [sqlsmith] Crash reading pg_stat_activity
Previous Message	Thomas Munro	2017-01-04 21:41:01	Re: An isolation test for SERIALIZABLE READ ONLY DEFERRABLE