Re: Spilling hashed SetOps and aggregates to disk

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spilling hashed SetOps and aggregates to disk
Date: 2018-06-05 08:05:35
Message-ID: 1d04d545-3d3c-75fd-528b-1e6279fb81a4@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/05/2018 07:46 AM, Jeff Davis wrote:
> On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:
>> I expect the eviction strategy to be the primary design challenge of
>> this patch. The other bits will be mostly determined by this one
>> piece.
>
> Not sure I agree that this is the primary challenge.
>
> The cases that benefit from eviction are probably a minority. I see two
> categories that would benefit:
>
>   * Natural clustering in the heap. This sounds fairly common, but a
> lot of the cases that come to mind are too low-cardinality to be
> compelling; e.g. timestamps grouped by hour/day/month. If someone has
> run into a high-cardinality natural grouping case, let me know.
>   * ARRAY_AGG (or similar): individual state values can be large enough
> that we need to evict to avoid exceeding work_mem even if not adding
> any new groups.
>
> In either case, it seems like a fairly simple eviction strategy would
> work. For instance, we could just evict the entire hash table if
> work_mem is exceeded or if the hit rate on the hash table falls below a
> certain threshold. If there was really something important that should
> have stayed in the hash table, it will go back in soon anyway.
>
> So why should eviction be a major driver for the entire design? I agree
> it should be an area of improvement for the future, so let me know if
> you see a major problem, but I haven't been as focused on eviction.
>

My concern is more about what happens when the input tuple ordering is
inherently incompatible with the eviction strategy, greatly increasing
the amount of data written to disk during evictions.

Say for example that we can fit 1000 groups into work_mem, and that
there are 2000 groups in the input data set. If the input is correlated
with the groups, everything is peachy because we'll evict the first
batch, and then group the remaining 1000 groups (or something like
that). But if the input data is random (which can easily happen, e.g.
for IP addresses, UUIDs and such) we'll hit the limit repeatedly and
will evict much sooner.

I know you suggested simply dumping the whole hash table and starting
from scratch while we talked about this at pgcon, but ISTM it has
exactly this issue.

But I don't know if there actually is a better option - maybe we simply
have to accept this problem. After all, running slowly is still better
than OOM (which may or may not happen now).

I wonder if we can somehow detect this at run-time and maybe fall-back
to groupagg. E.g. we could compare number of groups vs. number of input
tuples when we first hit the limit. It's a rough heuristics, but maybe
sufficient.

>> While the primary goal of the patch is addressing the OOM risks in
>> hash
>> aggregate, I think it'd be a mistake to see it just that way. I see
>> it
>> could allow us to perform hash aggregate more often, even if we know
>> the
>> groups won't fit into work_mem. If we could estimate how much of the
>> aggregate state we'll have to spill to disk, we could still prefer
>> hashagg over groupagg. We would pay the price for eviction, but on
>> large
>> data sets that can be massively cheaper than having to do sort.
>
> Agreed. I ran some tests of my patch in the last round, and they
> strongly supported choosing HashAgg a lot more often. A lot of sort
> improvements have been made though, so I really need to run some new
> numbers.
>

Yeah, Peter made the sort stuff a lot faster. But I think there still is
a lot of cases where the hashagg would greatly reduce the amount of data
that needs to be written to disk, and that's not something the sort
improvements could improve. If you need to aggregate a 1TB table, it's
still going to be ~1TB of data you need to write to disk during on-disk
sort. But if there is only a reasonable number of groups, that greatly
simplifies things.

>> far away), but it would be unfortunate to make this improvement
>> impossible/more difficult in the future.
>
> If you see anything that would make this difficult in the future, let
> me know.
>

Sure. Sorry for being too hand-wavy in this thread, and possibly moving
the goal posts further away :-/ I got excited by you planning to work on
this again and perhaps a bit carried away ;-)

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2018-06-05 08:44:49 Re: Performance regression with PostgreSQL 11 and partitioning
Previous Message Ashutosh Bapat 2018-06-05 07:41:47 Re: Remove mention in docs that foreign keys on partitioned tables are not supported