Re: Final Patch for GROUPING SETS

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Greg Sabino Mullane <greg(at)turnstep(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Subject: Re: Final Patch for GROUPING SETS
Date: 2014-12-13 04:37:48
Message-ID: 87fvcks1og.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

>> I'd already explained in more detail way back when we posted the
>> patch. But to reiterate: the ChainAggregate nodes pass through
>> their input data unchanged, but on group boundaries they write
>> aggregated result rows to a tuplestore shared by the whole
>> chain. The top node returns the data from the tuplestore after its
>> own output is completed.

Tom> That seems pretty grotty from a performance+memory consumption
Tom> standpoint. At peak memory usage, each one of the Sort nodes
Tom> will contain every input row,

Has this objection ever been raised for WindowAgg, which has the same
issue?

Tom> and the shared tuplestore will contain every output row.

Every output row except those produced by the top node, and since this
is after grouping, that's expected to be smaller than the input.

Tom> That will lead to either a lot of memory eaten, or a lot of
Tom> temp-file I/O, depending on how big work_mem is.

Yes. Though note that this code only kicks in when dealing with
grouping sets more complex than a simple rollup. A CUBE of two
dimensions uses only one Sort node above whatever is needed to produce
sorted input, and a CUBE of three dimensions uses only two. (It does
increase quite a lot for large cubes though.)

Tom> In principle, with the CTE+UNION approach I was suggesting, the
Tom> peak memory consumption would be one copy of the input rows in
Tom> the CTE's tuplestore plus one copy in the active branch's Sort
Tom> node. I think a bit of effort would be needed to get there (ie,
Tom> shut down one branch's Sort node before starting the next,
Tom> something I'm pretty sure doesn't happen today).

Correct, it doesn't.

However, I notice that having ChainAggregate shut down its input would
also have the effect of bounding the memory usage (to two copies,
which is as good as the append+sorts+CTE case).

Is shutting down and reinitializing parts of the plan really feasible
here? Or would it be a case of forcing a rescan?

>> Second was to generate a CTE for the input data. This didn't get
>> very far because everything that already exists to handle CTE
>> nodes assumes that they are explicit in the planner's input (that
>> they have their own Query node, etc.) and I was not able to
>> determine a good solution.

Tom> Seems like restructuring that wouldn't be *that* hard. We
Tom> probably don't want it to be completely like a CTE for planning
Tom> purposes anyway --- that would foreclose passing down any
Tom> knowledge of desired sort order, which we don't want. But it
Tom> seems like we could stick a variant of CtePath atop the chosen
Tom> result path of the scan/join planning phase. If you like I can
Tom> poke into this a bit.

Please do.

That seems to cover the high-priority issues from our point of view.

We will continue working on the other issues, on the assumption that
when we have some idea how to do it your way, we will rip out the
ChainAggregate stuff in favour of an Append-based solution.

--
Andrew (irc:RhodiumToad)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2014-12-13 07:05:39 9.4rc bug in percentile_cont
Previous Message Amit Kapila 2014-12-13 03:56:07 Re: Commitfest problems