Re: Spilling hashed SetOps and aggregates to disk

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spilling hashed SetOps and aggregates to disk
Date: 2018-06-05 13:17:23
Message-ID: 962b2a34-4189-3d98-5d74-aa55381ba759@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/05/2018 02:49 PM, Andres Freund wrote:
> Hi,
>
> On 2018-06-05 10:05:35 +0200, Tomas Vondra wrote:
>> My concern is more about what happens when the input tuple ordering is
>> inherently incompatible with the eviction strategy, greatly increasing the
>> amount of data written to disk during evictions.
>>
>> Say for example that we can fit 1000 groups into work_mem, and that there
>> are 2000 groups in the input data set. If the input is correlated with the
>> groups, everything is peachy because we'll evict the first batch, and then
>> group the remaining 1000 groups (or something like that). But if the input
>> data is random (which can easily happen, e.g. for IP addresses, UUIDs and
>> such) we'll hit the limit repeatedly and will evict much sooner.
>
>> I know you suggested simply dumping the whole hash table and starting from
>> scratch while we talked about this at pgcon, but ISTM it has exactly this
>> issue.
>
> Yea, that's the case I was thinking of where going to sorting will very
> likely have better performance.
>
> I think it'd even be sensible to have a "skew tuple" like
> optimization. When detecting getting closer to memory exhaustion, move
> new groups to the tuplesort, but already hashed tuples stay in the
> hashtable. That'd still need tuples being moved to the sort in the
> cases where the transition values get to big (say array_agg), but that
> should be comparatively rare. I'm sure we could do better in selecting
> the hash-tabled values than just taking the first incoming ones, but
> that shouldn't be too bad.
>

Not sure. I'd imagine the "compression" due to aggregating many tuples
into much smaller amount of memory would be a clear win, so I don't find
the "let's just dump all input tuples into a file" very attractive.

I think evicting a fraction of the aggregate state (say, ~25%) would
work better - choosing the "oldest" items seems OK, as those likely
represent many input tuples (thus having a high "compaction" ratio). Or
we could evict items representing rare groups, to make space for the
common ones. Perhaps a combined eviction strategy (10% of each type)
would work well. I'm conveniently ignoring the fact that we don't have
information to determine this, at the moment, of course.

I'm sure it's possible to make better decisions based on some cost
estimates, both at plan time and then during execution.

That being said, I don't want to over-think / over-engineer this.
Getting something that reduces the risk of OOM in the first step is a
good enough improvement. If we can do something better next, great.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-06-05 13:17:55 Re: Spilling hashed SetOps and aggregates to disk
Previous Message Andres Freund 2018-06-05 13:12:27 Re: commitfest 2018-07