From: | "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu> |
---|---|
To: | Sean Davis <sdavis2(at)mail(dot)nih(dot)gov> |
Cc: | Matthew Foster <matthew(dot)foster(at)noaa(dot)gov>, "pgsql-novice(at)postgresql(dot)org" <pgsql-novice(at)postgresql(dot)org> |
Subject: | Re: Strategy for doing number-crunching |
Date: | 2012-01-04 16:58:47 |
Message-ID: | 20120104165847.GD28556@staff-mud-56-27.rice.edu |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-novice |
On Wed, Jan 04, 2012 at 11:55:38AM -0500, Sean Davis wrote:
> On Wed, Jan 4, 2012 at 11:36 AM, Matthew Foster <matthew(dot)foster(at)noaa(dot)gov> wrote:
> > We have a database with approximately 130M rows, and we need to produce
> > statistics (e.g. mean, standard deviation, etc.) on the data. Right now,
> > we're generating these stats via a single SELECT, and it is extremely
> > slow...like it can take hours to return results. The queries are variable,
> > so there is no way I can think of to "pre-calculate" the stats.
> >
> > Here is an example query...
> >
> > SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae,
> > to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999')
> > AS bias,
> > to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS
> > mae_min,
> > to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999')
> > AS bias_min,
> > to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)),
> > 'FM990.999') AS rms_min,
> > to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev,
> > to_char(stddev(mae), 'FM990.999') AS mae_stddev,
> > to_char(stddev(bias), 'FM990.999') AS bias_stddev
> > FROM verify_data.verification_data
> > WHERE model_name='foo'
> > AND...several other conditions...
> >
> > As one might imagine, this query pounds the CPU relentlessly for hours.
> >
> > I'm thinking there must be a better approach for doing large amounts of
> > calculations on the data. Any pointers would be greatly appreciated!
>
> I'm assuming that someone will point out a quicker way with postgres,
> so I'll mention another solution.
>
> It may be quicker to pull the data out of the database and calculate
> in-memory using something like R (take a look at the ff package if
> memory is an issue). In R, a numeric vector of length 130M uses about
> 1Gb (a little less). Calculating both a mean and stddev on this
> vector takes a total of 750ms on my Mac laptop.
>
> Sean
>
Maybe pl/R would be an option as well.
Ken
From | Date | Subject | |
---|---|---|---|
Next Message | Sean Davis | 2012-01-04 17:04:08 | Re: Strategy for doing number-crunching |
Previous Message | Sean Davis | 2012-01-04 16:55:38 | Re: Strategy for doing number-crunching |