Performance of aggregates over set-returning functions

From: "John Smith" <sodgodofall(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Performance of aggregates over set-returning functions
Date: 2008-01-09 02:21:46
Message-ID: b88f0d670801081821r70fb10b4j4c176eed96918434@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

My set-up:

Postgres 8.2.5 on AMD x86_64 compiled with GCC 3.4.4 on Gentoo Linux 2.6.17
4 GB of RAM,
shared_buffers = 1000
work_mem = 1024

This is regarding performance of set-returning functions in queries. I
use generate_series() in the following as an example. The true
motivation is a need for a custom set-returning function that returns
large result sets, on which I'd like to use Postgres to compute groups
and aggregates.

Consider the following query:

postgres=# select count(*) from generate_series(1,100000000000000000);

A vmstat while this query was running seems to suggest that the
generate_series() was being materialized to disk first and then the
count computed (blocks were being written to disk even though there
was loads of memory still available).

A small variation in the query (i.e. placing the generate_series in
the select list without a from clause) exhibited a different
behaviour:

postgres=# select count(*) from (select
generate_series(1,100000000000000000)) as A;

This time, Postgres seemed to be trying to build the entire
generate_series() result set in memory and eventually started to swap
out until the swap space limit after which it crashed.

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Interestingly though, when the range in the generate_series() was
small enough to fit in 4 bytes of memory (e.g.
generate_series(1,1000000000) ), the above query completed consuming
only negligible amount of memory. So, it looked like the aggregate
computation was being pipelined with the tuples returned from
generate_series().

postgres=# select count(*) from (select
generate_series(1,1000000000)) as A;
count
------------
1000000000
(1 row)

It seems the only difference between the queries is that in the former
case, the generate_series(bigint, bigint) overload is selected,
whereas in the latter case, the generate_series(int, int) overload is
selected. A little investigation seemed to suggest that in the former
case (where the generate_series() was returning bigint), Postgres was
using palloc to store the numbers so that it could pass them by
reference. By contrast, in the latter case, the numbers are small
enough (4-byte int) to be passed by value.

Assuming the aggregate computation is pipelined with the reading of
the tuples from the function (like in the 4-byte case), the aggregate
was probably immediately updated. But then, the memory so allocated
wasnt freed and that's what was resulting in eating up memory. Is this
right? If not, could you explain why this behaviour was being
observed?

First question: why isnt the aggregate computation being pipelined
with the read of the tuples from the set-returning function in the
first query's case (i.e., when generate_series appears in the from
list)?

Second question: is there a recommended way of using set returning
functions in queries that will enable such a pipelining, i.e., without
materializing and without consuming huge amounts of memory?

Third question: more generally, is there a way I can ask Postgres to
report an Out-of-memory the moment it tries to consume greater than a
certain percentage of memory (instead of letting it hit the swap and
eventually die or thrash) ?

Thanks!
- John

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2008-01-09 03:07:23 Re: Performance of aggregates over set-returning functions
Previous Message Tom Lane 2008-01-09 00:05:55 Re: Loss of cache when persistent connexion is closed