Re: WIP: multivariate statistics / proof of concept

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: multivariate statistics / proof of concept
Date: 2016-11-22 03:42:14
Message-ID: 6a0b84cf-24e3-bf28-c4ac-b3ef33c3b55f@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/21/2016 11:10 PM, Robert Haas wrote:
> [ reviving an old multivariate statistics thread ]
>
> On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On 12 October 2014 23:00, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>
>>> It however seems to be working sufficiently well at this point, enough
>>> to get some useful feedback. So here we go.
>>
>> This looks interesting and useful.
>>
>> What I'd like to check before a detailed review is that this has
>> sufficient applicability to be useful.
>>
>> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
>> result of multi-column stats errors.
>>
>> Could you look at those queries and confirm that this patch can
>> produce better plans for them?
>
> Tomas, did you ever do any testing in this area? One of my
> colleagues, Rafia Sabih, recently did some testing of TPC-H queries @
> 20 GB. Q18 actually doesn't complete at all right now because of an
> issue with the new simplehash implementation. I reported it to Andres
> and he tracked it down, but hasn't posted the patch yet - see
> http://archives.postgresql.org/message-id/20161115192802.jfbec5s6ougxwicp@alap3.anarazel.de
>
> Of the remaining queries, the slowest are Q9 and Q20, and both of them
> have serious estimation errors. On Q9, things go wrong here:
>
> -> Merge Join
> (cost=5225092.04..6595105.57 rows=154 width=47) (actual
> time=103592.821..149335.010 rows=6503988 loops=1)
> Merge Cond:
> (partsupp.ps_partkey = lineitem.l_partkey)
> Join Filter:
> (lineitem.l_suppkey = partsupp.ps_suppkey)
> Rows Removed by Join Filter: 19511964
> -> Index Scan using
> [snip]
>
> Rows Removed by Filter: 756627
>
> The estimate for the index scan on partsupp is essentially perfect,
> and the lineitem-part join is off by about 3x. However, the merge
> join is off by about 4000x, which is real bad.
>

The patch only deals with statistics on base relations, no joins, at
this point. It's meant to be extended in that direction, so the syntax
supports it, but at this point that's all. No joins.

That being said, this estimate should be improved in 9.6, when you
create a foreign key between the tables. In fact, that patch was exactly
about Q9.

This is how the join estimate looks on scale 1 without the FK between
the two tables:

QUERY PLAN
-----------------------------------------------------------------------
Merge Join (cost=19.19..700980.12 rows=2404 width=261)
Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
(lineitem.l_suppkey = partsupp.ps_suppkey))
-> Index Scan using idx_lineitem_part_supp on lineitem
(cost=0.43..605856.84 rows=6001117 width=117)
-> Index Scan using partsupp_pkey on partsupp
(cost=0.42..61141.76 rows=800000 width=144)
(4 rows)

and with the foreign key:

QUERY PLAN
-----------------------------------------------------------------------
Merge Join (cost=19.19..700980.12 rows=6001117 width=261)
(actual rows=6001215 loops=1)
Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
(lineitem.l_suppkey = partsupp.ps_suppkey))
-> Index Scan using idx_lineitem_part_supp on lineitem
(cost=0.43..605856.84 rows=6001117 width=117)
(actual rows=6001215 loops=1)
-> Index Scan using partsupp_pkey on partsupp
(cost=0.42..61141.76 rows=800000 width=144)
(actual rows=6001672 loops=1)
Planning time: 3.840 ms
Execution time: 21987.913 ms
(6 rows)

> On Q20, things go wrong here:
>
> [snip]
>
> The estimate for the GroupAggregate feeding one side of the merge join
> is quite accurate. The estimate for the part-partsupp join on the
> other side is off by 8x. Then things get much worse: the estimate for
> the merge join is off by 400x.
>

Well, most of the estimation error comes from the join, but sadly the
aggregate makes using the foreign keys impossible - at least in the
current version. I don't know if it can be improved, somehow.

> I'm not really sure whether the multivariate statistics stuff will fix
> this kind of case or not, but if it did it would be awesome.
>

Join statistics are something I'd like to add eventually, but I don't
see how it could happen in the first version. Also, the patch received
no reviews this CF, and making it even larger is unlikely to make it
more attractive.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2016-11-22 04:28:33 Re: Danger of automatic connection reset in psql
Previous Message Dilip Kumar 2016-11-22 03:35:36 Re: Parallel bitmap heap scan