Skip site navigation (1) Skip section navigation (2)

Implementation of median in PostgreSQL - questions

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Implementation of median in PostgreSQL - questions
Date: 2010-07-05 05:51:56
Message-ID: AANLkTilOSxIlxfMHldZcgAhFGz9CB_1tjgF4MTdfVEnT@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Hello

I am planning to start to implement median function. I wrote some
array based implementation - it is fast, but I hope, so can be much
faster.

The basic question is method of implementation. It can be implemented
via a) custum aggregate functions or b) executor node.

Adventage of @a variant is simplicity - we need to teach aggregates to
ensure ordered input and ensure to use index only (maybe add flag
ORDERED INPUT [DESC|ASC]). Now PostgreSQL doesn't use a index for scan
to orderd aggregate - it can be a problem for large datasets. I found
some missing info in EXPLAIN about ordered aggregates - there are
showed nothing about sort

pavel(at)postgres:5432=# explain analyze verbose select array_agg(a order
by a) from omega;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1643.00..1643.01 rows=1 width=4) (actual
time=555.091..555.092 rows=1 loops=1)
   Output: array_agg(a ORDER BY a)
   ->  Seq Scan on public.omega  (cost=0.00..1393.00 rows=100000
width=4) (actual time=0.050..177.547 rows=100000 loops=1)
         Output: a
 Total runtime: 555.839 ms
(5 rows)

Probably we have to access a tuple store inside sfunc - when the data
size is out of work memory. And for effective evaluating it needs
patch  https://commitfest.postgresql.org/action/patch_view?id=292

variant @b is more complex - but allows more possibilities - idea:
median is one from aggregate executor nodes (theoretically it can call
some custom final function in future - but I don't think about it
now). It has a few advantages:

a) we don't need to modify current aggregates
b) for datasets smaller than working_mem can be used a quickselect
algorithms - www-stat.stanford.edu/~ryantibs/papers/median.pdf
c) for larger datasets we can use integrated external sort with direct
reading - we don't need to stack result to array

I prefer a variant b. It offers a more possibilities - and there are
less chance to break some existing.

comments are welcome

Pavel Stehule

pgsql-hackers by date

Next:From: Dimitri FontaineDate: 2010-07-05 09:52:39
Subject: Re: pg_archive_bypass
Previous:From: Pavel StehuleDate: 2010-07-05 03:31:34
Subject: Re: proof concept: do statement parametrization

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group