Re: Spilling hashed SetOps and aggregates to disk

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: David Gershuni <dgershun(at)cs(dot)cmu(dot)edu>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spilling hashed SetOps and aggregates to disk
Date: 2018-06-21 23:57:05
Message-ID: 1529625425.3820.81.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2018-06-21 at 13:44 -0700, David Gershuni wrote:
> To handle hash collisions, we can do the following:
>
> 1) We track the current hash code we’re processing, in ascending
> order.
>
> 2) Instead of combining one group at at time, we’ll maintain a list
> of
> all groups we’ve seen that match the current hash code.
>
> 3) When we peak at a spill file, if its next group’s hash code
> matches
> our current hash code, we’ll check to see if that group matches any
> of the
> groups in our list. If so, we’ll combine it with the matching group.
> If not,
> we’ll add this group to our list.
>
> 4) Once the head of each spill file exceeds the current hash code,
> we’ll
> emit our list as output, empty it, and advance to the next hash code.
>
> Does that seem reasonable?

It seems algorithmically reasonable but awfully complex. I don't think
that's a good path to take.

I still only see two viable approaches: 

(1) Spill tuples into hash partitions: works and is a reasonably simple
extension of our existing code. This is basically what the last patch I
posted does (posted before grouping sets, so needs to be updated).

(2) Spill tuples into a sort: I like this approach because it can be
done simply (potentially less code than #1), and could be further
improved without adding a ton of complexity. It may even eliminate the
need for the planner to choose between hashagg and sort. The problem
here is data types that have a hash opclass but no btree opclass. I
might try to sidestep this problem by saying that data types with no
btree opclass will not obey work_mem.

Additionally, there are two other ideas that we might want to do
regardless of which approach we take:

* support evicting transition states from the hash table, so that we
can handle clustered input better

* refactor grouping sets so that phases are separate executor nodes so
that we can undo some of the complexity in nodeAgg.c

Regards,
Jeff Davis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2018-06-22 00:24:45 Re: PATCH: backtraces for error messages
Previous Message Bruce Momjian 2018-06-21 23:46:35 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)