Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From: Feng Tian <ftian(at)vitessedata(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date: 2015-06-20 15:29:33
Message-ID: CAFWGqnvsVf-XdhoM9kqE7ko9MSXDMT8DAA=8z2Q_jA4uq1nEHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jun 20, 2015 at 7:56 AM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> Hi,
>
> On 06/20/2015 08:54 AM, Feng Tian wrote:
>
>>
>> While better sample/stats is important for choosing a good plan, in
>> this query, hash agg is really the right plan. If a sort agg is
>> chosen, the performance will be really really bad. The patch that
>> Jeff is working on is critical for a decent TPCH number (unless you
>> have unlimited amount of memory).
>>
>
> I do agree that Jeff's memory-bounded hashagg patch is very important
> feature, and in fact we spent a fair amount of time discussing it in
> Ottawa. So I'm looking forward to getting that soon ;-)
>
> But I don't think hashagg is going to be very good in this particular
> case. With a 3TB dataset, the query runs out of memory on a machine with
> 256GB of RAM. So let's assume a complete hash table has ~256GB. With
> work_mem=1GB that means only ~1/256 of the table can be processed in one
> batch, so we'll process the first 1/256 of the table, and write out the
> remaining 99% into batches. Then we'll read the batches one by one, and
> process those. The table has ~2.5TB, so we'll read 2.5TB, write out ~2.49TB
> into batches, and then read those ~2.49TB again. At least that's how I
> understand Jeff's memory-bounded hashagg proposal.
>
> The sort may perform worse in the general case, but in this case there's
> an index on the column, and the table is almost perfectly correlated by
> that column (due to generating the orders one by one, but it seems
> plausible it'd be the same in reality, assuming the orders are numbered
> using a sequence). So doing the sort by an indexscan seems rather cheap,
> and you only need to scan the table once.
>
> regards
>
>
> --
> Tomas Vondra http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

I have not read Jeff's patch, but here is how I think hash agg should work,

Hash agg scan lineitem table, perform aggregation in memory. Once workmem
is exhausted, it write intermediate state to disk, bucket by bucket. When
lineitem table is finished, it reads all tuples from one bucket back,
combining intermediate state and finalize the aggregation. I saw a quite
extensive discussion on combining aggregation on the dev list, so I assume
it will be added.

Assume after modulo an efficient size for I/O, workmem is bigger than the
square root of data after aggregation, the above algorithm can finish by
write out and read back only once.

For TPCH 3T, lineitem table has about 20 billion rows, 4 or 5 billion
orders. For the simple subquery, one need to
1. scan table, 3TB I/O
2. write out intermediate state, 4 billion * size of (key column +
intermediate state ~ 20 bytes) = 80GB
3. read back 80GB.

If sort is used, also assume workmem bigger than sqrt of data, you need to
scan table, write out about 20B * 20 ~ 400GB, read back 400GB. Sort may
have to do extra rounds of merge, but let's ignore that.

Hash agg has better performace, because,
1. less I/O
2. hash is a linear algorithm, compared to sort at n*lg(n).

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-06-20 15:32:55 Re: pg_stat_*_columns?
Previous Message Tomas Vondra 2015-06-20 15:28:56 Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H