Re: Spilling hashed SetOps and aggregates to disk

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spilling hashed SetOps and aggregates to disk
Date: 2018-06-11 14:25:45
Message-ID: CA+Tgmoan8fvokcx5ZmYoLnbD7XSh1pBBQsx6MdYY1MNCwBn7PA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 5, 2018 at 1:46 AM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:
>> I expect the eviction strategy to be the primary design challenge of
>> this patch. The other bits will be mostly determined by this one
>> piece.
>
> Not sure I agree that this is the primary challenge.
>
> The cases that benefit from eviction are probably a minority. I see two
> categories that would benefit:
>
> * Natural clustering in the heap. This sounds fairly common, but a
> lot of the cases that come to mind are too low-cardinality to be
> compelling; e.g. timestamps grouped by hour/day/month. If someone has
> run into a high-cardinality natural grouping case, let me know.
> * ARRAY_AGG (or similar): individual state values can be large enough
> that we need to evict to avoid exceeding work_mem even if not adding
> any new groups.
>
> In either case, it seems like a fairly simple eviction strategy would
> work. For instance, we could just evict the entire hash table if
> work_mem is exceeded or if the hit rate on the hash table falls below a
> certain threshold. If there was really something important that should
> have stayed in the hash table, it will go back in soon anyway.
>
> So why should eviction be a major driver for the entire design? I agree
> it should be an area of improvement for the future, so let me know if
> you see a major problem, but I haven't been as focused on eviction.

Hmm, I think the eviction strategy is pretty important. For example,
as we discussed in the hallway at PGCon, if we're flushing the whole
hash table (or one of several hash tables), we can account for memory
usage chunk-by-chunk instead of palloc-by-palloc, which as you pointed
out is more accurate and substantially less expensive. If we are
evicting individual tuples, though, it's quite possible that evicting
even a large number of tuples might not free up any chunks, so we'd
probably have to do accounting palloc-by-palloc. Working harder to
get a less accurate answer isn't real appealing even if the absolute
overhead is low.

As Andres mentioned, it also matters what kind of thing we evict. If
we push raw input tuples to a spool file in lieu of creating new
groups and deal with those in a second pass with a new hash table, we
don't need anything extra at all. If the second pass uses sort +
group, we need the aggregates to be both hashable and sortable. If we
evict transition tuples rather than input tuples, we need
serialization + deserialization functions. If we do that in a way
that might create multiple transition values for the same combination
of grouping values, then we also need combine functions.

Maybe that's not exactly what Tomas (or you) meant by eviction
strategy, though. Not sure. But it does seem to me that we need to
pick the algorithm we're going to use, more or less, in order to
decide what infrastructure we need, and at least to some extent, we
ought to let our choice of algorithm be informed by the desire not to
need too much infrastructure.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-06-11 14:26:32 Re: why partition pruning doesn't work?
Previous Message Tom Lane 2018-06-11 14:04:47 Re: [bug fix] ECPG: freeing memory for pgtypes crashes on Windows