Re: 10 TB database

From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Artur" <a_wronski(at)gazeta(dot)pl>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: 10 TB database
Date: 2009-06-15 18:51:09
Message-ID: D425483C2C5C9F49B5B7A41F89441547029622A7@postal.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> -----Original Message-----
> From: pgsql-general-owner(at)postgresql(dot)org [mailto:pgsql-general-
> owner(at)postgresql(dot)org] On Behalf Of Artur
> Sent: Monday, June 15, 2009 5:00 AM
> To: pgsql-general(at)postgresql(dot)org
> Subject: [GENERAL] 10 TB database
>
> Hi!
>
> We are thinking to create some stocks related search engine.
> It is experimental project just for fun.
>
> The problem is that we expect to have more than 250 GB of data every
> month.
> This data would be in two tables. About 50.000.000 new rows every
> month.
>
> We want to have access to all the date mostly for generating user
> requesting reports (aggregating).
> We would have about 10TB of data in three years.
>
> Do you think is it possible to build this with postgresql and have any
> idea how to start? :)

Consider summarization of this data into a data warehouse.
Most of the data will be historical and therefore the vast majority of
the data will be read-mostly (with the rare write operations probably
consisting mostly of corrections).
You won't want to scan the whole 10TB every time you make a
summarization query.

I have an idea that might make an interesting experiment:
Create tables that are a combination of year and month.
Create views that combine all 12 months into one yearly table.
Create a view that combines all the yearly views into one global view.
The reason that I think this suggestion may have some merit is that the
historical trends will not need to be recalculated on a daily basis (but
it would be nice if you could perform calculations against the whole
pile at will on rare occasions). By maintaining separate tables by
month, it will reduce the average depth of the b-trees. I guess that
for the most part, the active calculations will be only against recent
data (e.g. the past 6 months to one year or so). It could also be
interesting to create a view that combines the N most recent months of
data, where N is supplied on the fly (I do not know how difficult it
would be to create this view or even if it is possible).

If you are going to collect a terrific volume of data like this, I
suggest that a mathematics package might be coupled with the data like
SAS, R, Octave, SciLab, Maxima, etc. so that you can support decisions
derived from the data effectively.

You are also going to need high-end hardware to support a database like
this. Just some ideas you might like to test when you start fooling
around with this data.

IMO-YMMV

In response to

Browse pgsql-general by date

  From Date Subject
Next Message David Kerr 2009-06-15 18:52:30 Re: Amazon EC2 | Any recent developments
Previous Message Scott Marlowe 2009-06-15 18:49:39 Re: horizontal sharding