Quick Links

Re: Strategy for doing number-crunching

From:	Sean Davis <sdavis2(at)mail(dot)nih(dot)gov>
To:	Matthew Foster <matthew(dot)foster(at)noaa(dot)gov>
Cc:	"pgsql-novice(at)postgresql(dot)org" <pgsql-novice(at)postgresql(dot)org>
Subject:	Re: Strategy for doing number-crunching
Date:	2012-01-04 16:55:38
Message-ID:	CANeAVBmy0dQARTi6F8Fqu_DMtWii-s7rCQrOyEY5Qy9NFur1pA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-novice

On Wed, Jan 4, 2012 at 11:36 AM, Matthew Foster <matthew(dot)foster(at)noaa(dot)gov> wrote:
> We have a database with approximately 130M rows, and we need to produce
> statistics (e.g. mean, standard deviation, etc.) on the data. Right now,
> we're generating these stats via a single SELECT, and it is extremely
> slow...like it can take hours to return results. The queries are variable,
> so there is no way I can think of to "pre-calculate" the stats.
>
> Here is an example query...
>
> SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae,
> to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999')
> AS bias,
> to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS
> mae_min,
> to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999')
> AS bias_min,
> to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)),
> 'FM990.999') AS rms_min,
> to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev,
> to_char(stddev(mae), 'FM990.999') AS mae_stddev,
> to_char(stddev(bias), 'FM990.999') AS bias_stddev
> FROM verify_data.verification_data
> WHERE model_name='foo'
> AND...several other conditions...
>
> As one might imagine, this query pounds the CPU relentlessly for hours.
>
> I'm thinking there must be a better approach for doing large amounts of
> calculations on the data. Any pointers would be greatly appreciated!

I'm assuming that someone will point out a quicker way with postgres,
so I'll mention another solution.

It may be quicker to pull the data out of the database and calculate
in-memory using something like R (take a look at the ff package if
memory is an issue). In R, a numeric vector of length 130M uses about
1Gb (a little less). Calculating both a mean and stddev on this
vector takes a total of 750ms on my Mac laptop.

Sean

In response to

Strategy for doing number-crunching at 2012-01-04 16:36:16 from Matthew Foster

Responses

Re: Strategy for doing number-crunching at 2012-01-04 16:58:47 from ktm@rice.edu

Browse pgsql-novice by date

	From	Date	Subject
Next Message	ktm@rice.edu	2012-01-04 16:58:47	Re: Strategy for doing number-crunching
Previous Message	Tom Lane	2012-01-04 16:48:32	Re: Strategy for doing number-crunching