Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date: 2015-06-19 19:27:59
Message-ID: 55846D3F.3070607@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/19/2015 08:32 PM, Jeff Janes wrote:
> On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>
> Hi,
>
> I'm currently running some tests on a 3TB TPC-H data set, and I
> tripped over a pretty bad n_distinct underestimate, causing OOM in
> HashAgg (which somehow illustrates the importance of the
> memory-bounded hashagg patch Jeff Davis is working on).
>
> The problem is Q18, particularly this simple subquery:
>
> select l_orderkey
> from lineitem
> group by l_orderkey
> having sum(l_quantity) > 313;
>
> which is planned like this:
>
> QUERY PLAN
> ---------------------------------------------------------------------------------
> HashAggregate (cost=598510163.92..598515393.93 rows=418401 width=12)
> Group Key: l_orderkey
> Filter: (sum(l_quantity) > '313'::double precision)
> -> Seq Scan on lineitem (cost=0.00..508509923.28
> rows=18000048128 width=12)
> (4 rows)
>
> but sadly, in reality the l_orderkey cardinality looks like this:
>
> tpch=# select count(distinct l_orderkey) from lineitem;
> count
> ------------
> 4500000000
> (1 row)
>
> That's a helluva difference - not the usual one or two orders of
> magnitude, but 10000x underestimate.
>
>
>
> Is the row order in the table correlated with the value l_orderkey?

The statistics look like this:

attnum | attname | n_distinct | correlation
--------+-----------------+-------------+-------------
1 | l_orderkey | 4.27349e+07 | 0.988631

so yes, it's pretty much perfectly correlated.

> Could you create copy of the table ordered at random, and see if it
> exhibits the same estimation issue?

Sadly no - this is a 2.5TB table, so it's really easy to do. But I don't
really see why that should matter, because the sample is supposed to be
random.

But I think you might be on to something, because I manually collected a
random sample with 30k rows (by explicitly generating 30k random TIDs),
and I get this:

tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt
from lineitem_sample group by 1) foo group by 1;

cnt | count
-----+-------
1 | 29998
2 | 1
(2 rows)

That's quite different compared to what analyze gets, which effectively
looks something like this (this is derived from the logs, so not
perfectly accurate - I only have f1, ndistinct, nmultiple):

cnt | count
-----+-------
1 | 27976
2 | 976
3 | 24

Am I wrong or is the sample not that random?

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-06-19 19:44:04 Re: 9.5 release notes
Previous Message Josh Berkus 2015-06-19 19:20:24 Re: pgbench - allow backslash-continuations in custom scripts