Re: Windowing Function Patch Review -> Standard Conformance

From: "David Rowley" <dgrowley(at)gmail(dot)com>
To: "'Hitoshi Harada'" <umi(dot)tanuki(at)gmail(dot)com>
Cc: "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Windowing Function Patch Review -> Standard Conformance
Date: 2008-12-05 21:44:44
Message-ID: 233A00BB1F264BB3B18EC9E89CC4CD14@amd64
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hitoshi Harada wrote:
> I tested the spool performance with David's earlier bigtable:
>
> CREATE TABLE bigtable (
> id SERIAL NOT NULL PRIMARY KEY,
> timestamp TIMESTAMP NOT NULL
> );
>
> -- about 383MB of data
> INSERT INTO bigtable (timestamp)
> SELECT NOW() + (CAST(RANDOM() * 10 AS INT) || ' secs')::INTERVAL
> FROM generate_series(1,10000000);
>
> CREATE INDEX bigtable_timestamp_idx ON bigtable (timestamp);
>
> VACUUM ANALYZE bigtable;
>
> sample=# SELECT COUNT(*) FROM bigtable;
> count
> ----------
> 10000000
> (1 row)
>
> sample=# SELECT LEAD(timestamp) OVER (ORDER BY id) FROM bigtable LIMIT 1;
> lead
> ----------------------------
> 2008-12-02 00:15:10.288461
> (1 row)
>
> sample=# EXPLAIN ANALYZE SELECT LEAD(timestamp) OVER (ORDER BY id)
> FROM bigtable LIMIT 1;
>
> QUERY PLAN
>
> --------------------------------------------------------------------------
> --------------------
> ---------------------------------------------------
> Limit (cost=0.00..0.04 rows=1 width=12) (actual time=0.038..0.039
> rows=1 loops=1)
> -> Window (cost=0.00..386612.13 rows=10000000 width=12) (actual
> time=0.036..0.036 rows=1
> loops=1)
> -> Index Scan using bigtable_pkey on bigtable
> (cost=0.00..286612.13 rows=10000000 w
> idth=12) (actual time=0.018..0.021 rows=2 loops=1)
> Total runtime: 0.071 ms
> (4 rows)
>
>
> shows quite good result. Great work.

Amazing improvement!

Old patch:
david=# select timestamp,lag(timestamp,1) over (order by id) from bigtable
order by id limit 1;
timestamp | lag
----------------------------+-----
2008-11-10 21:55:16.498458 |
(1 row)

Time: 105205.055 ms

New patch:
david=# select timestamp,lag(timestamp,1) over (order by id) from bigtable
order by id limit 1;
timestamp | lag
----------------------------+-----
2008-12-04 22:05:22.687975 |
(1 row)

Time: 1.640 ms

>
> The following query works on my build:
>
> > SELECT depname,SUM(SUM(salary)) OVER (ORDER BY depname) FROM empsalary
> GROUP
> > BY depname;
> > ERROR: variable not found in subplan target list
>

This works fine on my build now too.

>
> Now, I am thinking about refactoring around aggregate common code, and
> renaming WFunc to WinFunc, which leads pg_proc.proiswfunc be
> pg_proc.proiswinfunc and so on if no objections come.
>

I've spent last night and tonight trying to break the patch and I've not
managed it.

I spent 2 and a half hours on the train last night reading over the patch
mainly for my own interest. I also went over the documentation and I have a
few suggestions for improvement:

My modifications to the lead and lag syntax can be improved. Currently the
optional parameters make it look like DEFAULT can be specified without
OFFSET. This is not the case:

+ <type>any, [integer], [any]</type>

Should be:

+ <type>any [,integer [,any] ]</type>

And:

+ lag(<replaceable class="parameter">value</replaceable>,
[<replaceable
+ class="parameter">offset</replaceable>], [<replaceable
+ class="parameter">default</replaceable>])

Should be:

+ lag(<replaceable class="parameter">value</replaceable> [,
<replaceable
+ class="parameter">offset</replaceable> [,<replaceable
+ class="parameter">default</replaceable>] ])

Same for LEAD()

+ <para>
+ After <literal>WHERE</> and <literal>GROUP BY</> process,
+ rows might be windowed table, using the <literal>WINDOW</>
+ clause.
+ </para>

I think I know what you mean here. My re-write seems to have turned the
sentence into a paragraph. Please tell me if I've assumed the meaning
wrongly:

"After the <literal>WHERE</>, <literal>GROUP BY</> and <literal>HAVING</>
clauses one or more <literal>WINDOW</> clauses can be specified. This will
allow window functions to be specified in the <literal>SELECT</> clause.
These window functions can make use of the <literal>WINDOW</> clauses by
making reference to the alias name of the window rather than explicitly
specifying the properties of the window in each <literal>OVER</> clause."

+ <para>
+ Another expample shows different capability of window functions
+ from above.

Small typo: example instead of eapample

+ <para>
+ The same window definitions can be named and put togather into one
+ definition using <xref linkend="queries-window">.

Small typo: together instead of togather.

+ <para>
+ Window functions doesn't accept DISTINCT and ALL syntax, even though
+ the function is an aggregate function.
+ </para>

Small grammar error: doesn't should be replaced with don't. Or perhaps
change to:

"Window functions, unlike normal aggregate functions, do not allow DISTINCT
or ALL to be used within the function argument list."

+ Window functions are not placed in any of GROUP BY, HAVING and
+ WHERE clauses, which process values before any of the windows. If
+ there is need to qualify rows by the result of window functions,
+ whole of the query must be nested and append WHERE clause outer of
+ the current query.

I think this one maybe needs an example to back it up. It's quite an
important thing and I'm sure lots of people will need to do this. I'm not
100% happy with my new paragraph either but can't see how to word it any
better.

"Window functions cannot be used in the WHERE, GROUP BY or HAVING clauses
of the query. If there is a need to filter rows, group results or filter
rows after aggregation takes place (HAVING) then the query must be nested.
The query should contain the window functions in the inner query and apply
the additional clauses that contain the results from the window function in
the outer query, such as:

SELECT depname,
empno,
salary,
enroll_date
FROM (SELECT depname,
empno,
salary,
enroll_date,
ROW_NUMBER() OVER (PARTITION BY depname ORDER BY salary,empno)
AS pos
FROM empsalary
) AS e
WHERE pos = 1;

In the above query the we're filtering and only showing the results from the
inner query where the ROW_NUMBER() value is equal to 1."

But of course the above query would be more simple using DISTINCT ON. Maybe
there is a better example... My previous marathon getting the person in 2nd
place might be better but that's introducing another previously unknown
table to the manual.

+ * A window function is defined as a function matked as wfunc in pg_proc.
By
+ * this mark, it means the function can handle window function APIs that
allow
+ * it to access arbitrary random rows within the window.

Small typo: marked instead of matked

Maybe this fragment should read:

"A window function is a function that can be used by the window function
API. This allows access arbitrary random rows within the window. These
functions are identified by the proiswfunc column in pg_proc."

Or pg_proc.proiswfunc if you've renamed.

David.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2008-12-05 22:05:41 Re: Windowing Function Patch Review -> Standard Conformance
Previous Message Alex Hunsaker 2008-12-05 21:32:57 Re: new libpq SSL connection option