From: | Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Implementation of median in PostgreSQL - questions |
Date: | 2010-07-05 05:51:56 |
Message-ID: | AANLkTilOSxIlxfMHldZcgAhFGz9CB_1tjgF4MTdfVEnT@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello
I am planning to start to implement median function. I wrote some
array based implementation - it is fast, but I hope, so can be much
faster.
The basic question is method of implementation. It can be implemented
via a) custum aggregate functions or b) executor node.
Adventage of @a variant is simplicity - we need to teach aggregates to
ensure ordered input and ensure to use index only (maybe add flag
ORDERED INPUT [DESC|ASC]). Now PostgreSQL doesn't use a index for scan
to orderd aggregate - it can be a problem for large datasets. I found
some missing info in EXPLAIN about ordered aggregates - there are
showed nothing about sort
pavel(at)postgres:5432=# explain analyze verbose select array_agg(a order
by a) from omega;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1643.00..1643.01 rows=1 width=4) (actual
time=555.091..555.092 rows=1 loops=1)
Output: array_agg(a ORDER BY a)
-> Seq Scan on public.omega (cost=0.00..1393.00 rows=100000
width=4) (actual time=0.050..177.547 rows=100000 loops=1)
Output: a
Total runtime: 555.839 ms
(5 rows)
Probably we have to access a tuple store inside sfunc - when the data
size is out of work memory. And for effective evaluating it needs
patch https://commitfest.postgresql.org/action/patch_view?id=292
variant @b is more complex - but allows more possibilities - idea:
median is one from aggregate executor nodes (theoretically it can call
some custom final function in future - but I don't think about it
now). It has a few advantages:
a) we don't need to modify current aggregates
b) for datasets smaller than working_mem can be used a quickselect
algorithms - www-stat.stanford.edu/~ryantibs/papers/median.pdf
c) for larger datasets we can use integrated external sort with direct
reading - we don't need to stack result to array
I prefer a variant b. It offers a more possibilities - and there are
less chance to break some existing.
comments are welcome
Pavel Stehule
From | Date | Subject | |
---|---|---|---|
Next Message | Dimitri Fontaine | 2010-07-05 09:52:39 | Re: pg_archive_bypass |
Previous Message | Pavel Stehule | 2010-07-05 03:31:34 | Re: proof concept: do statement parametrization |