Implementation of median in PostgreSQL - questions

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Implementation of median in PostgreSQL - questions
Date: 2010-07-05 05:51:56
Message-ID: AANLkTilOSxIlxfMHldZcgAhFGz9CB_1tjgF4MTdfVEnT@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello

I am planning to start to implement median function. I wrote some
array based implementation - it is fast, but I hope, so can be much
faster.

The basic question is method of implementation. It can be implemented
via a) custum aggregate functions or b) executor node.

Adventage of @a variant is simplicity - we need to teach aggregates to
ensure ordered input and ensure to use index only (maybe add flag
ORDERED INPUT [DESC|ASC]). Now PostgreSQL doesn't use a index for scan
to orderd aggregate - it can be a problem for large datasets. I found
some missing info in EXPLAIN about ordered aggregates - there are
showed nothing about sort

pavel(at)postgres:5432=# explain analyze verbose select array_agg(a order
by a) from omega;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1643.00..1643.01 rows=1 width=4) (actual
time=555.091..555.092 rows=1 loops=1)
Output: array_agg(a ORDER BY a)
-> Seq Scan on public.omega (cost=0.00..1393.00 rows=100000
width=4) (actual time=0.050..177.547 rows=100000 loops=1)
Output: a
Total runtime: 555.839 ms
(5 rows)

Probably we have to access a tuple store inside sfunc - when the data
size is out of work memory. And for effective evaluating it needs
patch https://commitfest.postgresql.org/action/patch_view?id=292

variant @b is more complex - but allows more possibilities - idea:
median is one from aggregate executor nodes (theoretically it can call
some custom final function in future - but I don't think about it
now). It has a few advantages:

a) we don't need to modify current aggregates
b) for datasets smaller than working_mem can be used a quickselect
algorithms - www-stat.stanford.edu/~ryantibs/papers/median.pdf
c) for larger datasets we can use integrated external sort with direct
reading - we don't need to stack result to array

I prefer a variant b. It offers a more possibilities - and there are
less chance to break some existing.

comments are welcome

Pavel Stehule

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2010-07-05 09:52:39 Re: pg_archive_bypass
Previous Message Pavel Stehule 2010-07-05 03:31:34 Re: proof concept: do statement parametrization