Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

From: Олег Царев <zabivator(at)gmail(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Subject: Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)
Date: 2009-05-10 15:41:27
Message-ID: 54f48e4f0905100841l79595f2ekbfa166e2d586b98a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I will write separated mail about rollup.
Can you send me some links or information about "CTE"? What is it?
Also, I need some deep knownledge about postgresql aggregation
calculation (executor part) - can you help me (links, books, name of
source files, etc)?.
After get additional information i will can continue discussion.
Thanls
2009/5/10 Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:
> 2009/5/10 Олег Царев <zabivator(at)gmail(dot)com>:
>> Hello, Pavel.
>>
>> I read you letter, and found some points:
>>
>> a) Simple transformation any GROUPING SETS in GROUP BY + UNION ALL
>> require clone source for every group.
>> It's so slow.
>> My point: we can make for start simple implementation.
>> b) Your sollution it's so good, IMHO
>> WITH q AS (SELECT * FROM some)
>>  SELECT * FROM q GROUP BY a
>>  UNION ALL
>>  SELECT * FROM q GROUP BY b
>> But can you describe me how it's used on planner (calculate cost) and exector.
>
> look on CTE source code or ask to CTE author.
>
>> Sharing source - it's mind - we don't process tree, we process
>> DIRECTED GRAPH. Hard, many bugs, so on, not?
>
> for example - code for DISTINCT is shared with code for aggregation.
> My idea is based on similarity
>
> GROUPING SETS is subset of CTE, nonrecursive CTE is subset of UNION ALL
>
>> PostgreSQL support sharing nodes of executor? How? Where i can read?
>
> I don't know, what do you thing exactly sharing nodes? Before CTE I
> had to write special holder node, but it isn't necessary now.
>
>> Next. Ok, we spool source. After we use simple hash group/sort group +
>> union all? Or more advanced techniques?  If different group by - it's
>> not different from my idea, it's logic additional (spool source for
>> fast rewind)
>
> I thing so trivial implementation via UNION ALL should work, but any
> optimisations means lot of work - and when you add rewind
> functionality, you will got trivial nonrecursive CTE implementation.
> When I wrote patch, there wasn't possible an use CTE, so I played wit
> own nodes. Now, the more clean solution is probably based on current
> CTE base.
>
>>
>> c) Don't forget also about node requiments - if we use hash table as
>> container for source -  we reduce some sorting properties - and query
>> select A,B from TABLE group by ((A,B),(B)) order by a,b require
>> sorting on output of grouping sets.
>> It's mind - we need insert sort.
>
> Yes, probably there should be possible some techniques - that should
> to eliminate final sort op. I am not sure if it is really important.
> Now I thinking what is the bigger devil a) patch complexity b) some
> suboptimality  (mainly redundant call of agg nodes). For me a.
> Aggregates are not gratis, but significantly more expensive is
> repeated seq data scan. So it is first goal. Later we should to thing
> about next optimalizations.
>
>>
>> d) We can make SORT ROLLUP and SORT REDUCE ROLLUP.
>> first - sort group by use sorting for grouping data and calculate aggregations.
>> So, 'order by A,B,C' also 'order by A,B' also order by 'A' also order by ''
>> What it's mind? It's mind we can use sort features for implement SORT
>> ROLLUP. For every sub group we calculate self aggregates... oh, i bad
>> say. Look on "select a,sum(b) from table group by a".  Sort group
>> grouping by a, and for every a-group calculate aggregations (sum), for
>> ROLLUP A,B we can make three aggregations ; for A,B than A than () -
>> and use one path on source calculate aggregations on every step, and
>> for split source on different subsets.
>> It's more perfomance, than different group by and union all.
>
> Maybe, I don't know. What I know:
>
> * UNION ALL will have better result on indexed sets and MIN, MAX
> aggregates. CTE isn't optimized now for these cases.
> * UNION ALL will be slow for large sets (over cache),
> * CTE needs some more optimalizations,
> * pg core hackers dislike big and complex patches,
> * pg core hackers like any improved current code base.
>
> I am thinking so Grouping Sets based on CTE should be more commitable
> code. It doesn't mean so your ideas are wrong, but these
> optimalization should to work on CTE too.
>
> select * from table group by rollup(a,b,c)
>
> have to have generate same plan as
>
> with q as (select * from table)
>  select * from q group by a,b,c
>  union all
>  select * from q group by a,b
>  union all
>  select * from q group by a
>  union all
>  select * from q;
>
> and CTE is more general then Grouping Sets, so it is better do
> optimalization over CTE than Grouping Sets.
>
> Regards
> Pavel Stehule
>>
>> Look for MS SQL:
>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
>> Why MS SQL don't support distinct aggregations? I think - because
>> every distinct aggregations in MS SQL require hash, and many
>> aggregations - it's so slow.
>>
>> Thank you!
>> 2009/5/10 Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:
>>> Hello
>>>
>>> some other info is on  http://wiki.postgresql.org/wiki/Grouping_Sets
>>>
>>> Maybe in e few weak I'll have some a prototype based on CTE. My older
>>> technique based on hashed tables should be well, but it carry lot of
>>> unshared code. Using CTE means so we can optimize CTE and GROUPING
>>> SETS together. I plan to transform query
>>>
>>> SELECT * FROM some GROUP BY GROUPING SETS(a, b)
>>>
>>> to
>>>
>>> WITH q AS (SELECT * FROM some)
>>>  SELECT * FROM q GROUP BY a
>>>  UNION ALL
>>>  SELECT * FROM q GROUP BY b
>>>
>>>
>>> 2009/5/10 Олег Царев <zabivator(at)gmail(dot)com>:
>>>> Hello all.
>>>> Please, approve my ideas for implementation.
>>>>
>>>> Standart has feature T431: Extended grouping capabilities.
>>>> This feature i found in TODO-list:
>>>> http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO
>>>>
>>>> MS SQL 2005 partial support this feature:
>>>> http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
>>>> http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx
>>>>
>>>> MS SQL 2008 support this feature:
>>>> http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx
>>>>
>>>> Oracle support this feature:
>>>> http://www.compshack.com/sql/oracle-group-rollup
>>>>
>>>> So, it's short notes about GROUPING SETS, but more complete
>>>> information have in a official documentation of MS SQL and Oracle
>>>> (copyright limited for send as attach).
>>>>
>>>> First. GROUPG SETS.
>>>>
>>>> select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
>>>> () ) - it's example of use grouping sets.
>>>> Semantic of this construction - make group by over source more, than
>>>> one group of column.
>>>> It's very wide key - A,B C. In result set of this example we can find
>>>> result set of select   select A,B,C,SUM(D) from table group by A,B,C -
>>>> as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
>>>> SETS( (A,B,C), (A), () )
>>>> Two subset - is GROUP BY A B, and instead C column we look NULL.
>>>> Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
>>>> "GRAND TOTAL". - calculate over all subset without grouping
>>>>
>>>> Also have function "GROUPING"  it's function say about null - "real
>>>> null" (from table) or generated by "GROUP BY GROUPING SETS"
>>>>
>>>> My point: this feature can implement over GROUP BY and UNION ALL
>>>> We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
>>>> )" to  select A,B,C fron table GROUP BY A,B,C .UNION  ALL select
>>>> A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
>>>> group by();
>>>>
>>>> So, it's very simple, don't require modification of executor and
>>>> callibrate cost - only parser and semantic anylysis,
>>>> '
>>>> So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
>>>> (A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
>>>> CUBE - analogue.
>>>>
>>>> If this idea it's good -  i can write code base on old patch
>>>> http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
>>>> from clean list (as you wish).
>>>
>>> This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ...,
>>> where X is some joined data, then you repeat JOIN on every grouping
>>> set. So your solution is simple for implementation, but it should be
>>> really slow.
>>>
>>> Regards
>>> Pavel Stehule
>>>
>>>>
>>>> In future i know how to implement ROLLUP more optimal (executor
>>>> iterator) and use this ROLLUP for optimisation another GROUP BY,
>>>> GROUPING SETS.
>>>>
>>>> Thanks.
>>>>
>>>> --
>>>> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
>>>> To make changes to your subscription:
>>>> http://www.postgresql.org/mailpref/pgsql-hackers
>>>>
>>>
>>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2009-05-10 23:23:45 Re: [BUGS] BUG #4721: All sub-tables incorrectly included in search plan for partitioned table
Previous Message Pavel Stehule 2009-05-10 15:11:39 Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)