Re: Hash support for grouping sets

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Mark Dilger <hornschnorter(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash support for grouping sets
Date: 2017-03-23 18:22:10
Message-ID: 87y3vw7z9g.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>>>> "Mark" == Mark Dilger <hornschnorter(at)gmail(dot)com> writes:

Mark> You define DiscreteKnapsack to take integer weights and double
Mark> values, and perform the usual Dynamic Programming algorithm to
Mark> solve. But the only place you call this, you pass in NULL for
Mark> the values, indicating that all the values are equal. I'm
Mark> confused why in this degenerate case you bother with the DP at
Mark> all. Can't you just load the knapsack from lightest to heaviest
Mark> after sorting, reducing the runtime complexity to O(nlogn)? It's
Mark> been a long day. Sorry if I am missing something obvious.

That's actually a very good question.

(Though in passing I should point out that the runtime cost is going to
be negligible in all practical cases. Even an 8-way cube (256 grouping
sets) has only 70 rollups, and a 4-way cube has only 6; the limit of
4096 grouping sets was only chosen because it was a nice round number
that was larger than comparable limits in other database products. Other
stuff in the grouping sets code has worse time bounds; the
bipartite-match code used to minimize the number of rollups is
potentially O(n^2.5) in the number of grouping sets.)

Thinking about it, it's not at all obvious what we should be optimizing
for here in the absence of accurate sort costs. Right now what matters
is saving as many sort steps as possible, since that's a saving likely
to be many orders of magnitude greater than the differences between two
sorts. The one heuristic that might be useful is to prefer the smaller
estimated size if other factors are equal, just on memory usage grounds,
but even that seems a bit tenuous.

I'm inclined to leave things as they are in the current patch, and maybe
revisit it during beta if we get any relevant feedback from people
actually using grouping sets?

Mark> I'm guessing you intend to extend the code at some later date to
Mark> have different values for items. Is that right?

Well, as I mentioned in a reply to Andres, we probably should eventually
optimize for greatest reduction in cost, and the code as it stands would
allow that to be added easily, but that would require having more
accurate cost info than we have now. cost_sort doesn't take into
consideration the number or types of sort columns or the costs of their
comparison functions, so all sorts of the same input data are costed
equally.

--
Andrew (irc:RhodiumToad)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2017-03-23 18:26:04 Re: createlang/droplang deprecated
Previous Message Jesper Pedersen 2017-03-23 18:19:12 Re: Page Scan Mode in Hash Index