Re: When should I worry?

From: Robert Treat <robert(at)omniti(dot)com>
To: pgsql-general(at)postgresql(dot)org
Cc: Tom Allison <tom(at)tacocat(dot)net>, Joe Conway <mail(at)joeconway(dot)com>, Bill Moran <wmoran(at)potentialtech(dot)com>
Subject: Re: When should I worry?
Date: 2007-06-11 05:23:35
Message-ID: 200706110123.36331.robert@omniti.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sunday 10 June 2007 18:25, Tom Allison wrote:
> On Jun 10, 2007, at 2:14 PM, Joe Conway wrote:
> > Bill Moran wrote:
> >> Tom Allison <tom(at)tacocat(dot)net> wrote:
> >>> If the user base gets to 100 or more, I'll be hitting a billion
> >>> rows before too long. I add about 70,000 rows per user per day.
> >>> At 100 users this is 7 million rows per day. I'll hit a billion
> >>> in 142 days, call it six months for simplicity.
> >>>
> >>> The table itself is small (two columns: bigint, int) but I'm
> >>> wondering when I'll start to hit a knee in performance and how I
> >>> can monitor that. I know where I work (day job) they have Oracle
> >>> tables with a billion rows that just plain suck. I don't know
> >>> if a billion is bad or if the DBA's were not given the right
> >>> opportunity to make their tables work.
> >>>
> >>> But if they are any indication, I'll feeling some hurt when I
> >>> exceed a billion rows. Am I going to just fold up and die in six
> >>> months?
> >
> > Alot depends on your specific use case.
> >
> > - Will you be just storing the data for archival purposes, or
> > frequently querying the data?
> >
> > - If you need to run queries, are they well bounded to certain
> > subsets of the data (e.g. a particular range of time for a
> > particular user) or are they aggregates across the entire billion
> > rows?
> >
> > - Is the data temporal in nature, and if so do you need to purge it
> > after some period of time?
> >
> > As an example, I have an application with temporal data, that needs
> > periodic purging, and is typically queried for small time ranges
> > (tens of minutes). We have set up partitioned tables (partitioned
> > by date range and data source -- akin to your users) using
> > constraint exclusion that contain 3 or 4 billion rows (total of all
> > partitions), and we have no problem at all with performance. But I
> > suspect that if we needed to do an aggregate across the entire
> > thing it would not be particularly fast ;-)
> >
> >> Why not just create a simulation of 100 users and run it as hard
> >> as your
> >> can until it starts to degrade? Then you'll have some real-world
> >> experience
> >> to tell you how much you can handle.
> >
> > This is good advice. Without much more detail, folks on the list
> > won't be able to help much, but a with simulation such as this you
> > can answer your own question...
> >
> > Joe
>
> Good questions. I guess there are two answers. There are times when
> I will want aggregate data and I'm not as concerned about the
> execution time.
> But there are other queries that are part of the application design.
> These are always going to be of a type where I know a single specific
> primary key value and I want to find all the rows that are related.
>
>
> First table has a row of
> idx serial primary key
> Third table has a row of
> idx bigserial primary key
>
> and a second table (the billion row table) consistes of two rows:
> first_idx integer not null references first(idx) on delete cascade,
> third_idx bigint not null references third(idx) on delete cascade,
> constraint pkey_first_third primary key (first_idx, third_idx)
>
> The common query will be:
>
> select t.string
> from first f, second s, third t
> where f.idx = s.first_idx
> and s.third_idx = t.idx
> and f.idx = 4 (or whatever...).
>
> So, I think the answer is that the data isn't going to be temporal or
> otherwise segragated or subsets.
> I'll assume this is a lead in for partitions?
> The data will be queried very frequently. Probably plan on a query
> every 10 seconds and I don't know what idx ranges will be involved.
> Would it be possible to partition this by the first_idx value? An
> improvement?
>

It sounds like it should be possible to partition on the first_idx value
(either by range or maybe a modulo type operation). It will certainly be an
improvement over a single billion row table (we have a couple and queries
across large portions of them are painful).

The other thing the test will tell you is how well your hardware will hold up.
Specing proper hardware to handle these types of loads can be tricky, so
verifying what you have will hold up is also of some value before you have
huge amounts of datayou have to move around.

--
Robert Treat
Database Architect
http://www.omniti.com/

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Johannes Konert 2007-06-11 07:03:15 Re: Suppress checking of chmod 700 on data-dir?
Previous Message David Fetter 2007-06-11 03:52:53 Re: How to count pairs?