Re: Thousands of tables versus on table?

From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: david(at)lang(dot)hm
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Thomas Andrews <tandrews(at)soliantconsulting(dot)com>, Mark Lewis <mark(dot)lewis(at)mir3(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Thousands of tables versus on table?
Date: 2007-06-06 15:07:19
Message-ID: 4666CDA7.2030205@g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

david(at)lang(dot)hm wrote:
> On Tue, 5 Jun 2007, Tom Lane wrote:
>
>> david(at)lang(dot)hm writes:
>>> however I really don't understand why it is more efficiant to have a 5B
>>> line table that you do a report/query against 0.1% of then it is to
>>> have
>>> 1000 different tables of 5M lines each and do a report/query against
>>> 100%
>>> of.
>>
>> Essentially what you are doing when you do that is taking the top few
>> levels of the index out of the database and putting it into the
>> filesystem; plus creating duplicative indexing information in the
>> database's system catalogs.
>>
>> The degree to which this is a win is *highly* debatable, and certainly
>> depends on a whole lot of assumptions about filesystem performance.
>> You also need to assume that constraint-exclusion in the planner is
>> pretty doggone cheap relative to the table searches, which means it
>> almost certainly will lose badly if you carry the subdivision out to
>> the extent that the individual tables become small. (This last could
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> what is considered 'small'? a few thousand records, a few million
> records?

I would say small is when the individual tables are in the 10 to 20
Megabyte range. How many records that is depends on record width, of
course. Basically, once the tables get small enough that you don't
really need indexes much, since you tend to grab 25% or more of each one
that you're going to hit in a query.

> what multiplication factor would there need to be on the partitioning
> to make it worth while? 100 tables, 1000 tables, 10000 tables?
Really depends on the size of the master table I think. If the master
table is about 500 Megs in size, and you partition it down to about 1
meg per child table, you're probably ok. Walking through 500 entries
for constraint exclusion seems pretty speedy from the tests I've run on
a 12M row table that was about 250 Megs, split into 200 to 400 or so
equisized child tables. The time to retrieve 85,000 rows that were all
neighbors went from 2 to 6 seconds, to about 0.2 seconds, and we got rid
of indexes entirely since they weren't really needed anymore.

> the company that I'm at started out with a seperate database per
> customer (not useing postgres), there are basicly zero cross-customer
> queries, with a large volume of updates and lookups.
>
> overall things have now grown to millions of updates/day (some
> multiple of this in lookups), and ~2000 customers, with tens of
> millions of rows between them.
>
> having each one as a seperate database has really helped us over the
> years as it's made it easy to scale (run 500 databases on each server
> instead of 1000, performance just doubled)
I think that for what you're doing, partitioning at the database level
is probably a pretty good compromise solution. Like you say, it's easy
to put busy databases on a new server to balance out the load. Hardware
is cheap.

> various people (not database experts) are pushing to install Oracle
> cluster so that they can move all of these to one table with a
> customerID column.
Have these people identified a particular problem they're trying to
solve, or is this a religious issue for them? From your description it
sounds like a matter of dogma, not problem solving.
> the database folks won't comment much on this either way, but they
> don't seem enthusiastic to combine all the data togeather.
I think they can see the fecal matter heading towards the rotational
cooling device on this one. I can't imagine this being a win from the
perspective of saving the company money.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Marlowe 2007-06-06 15:17:29 Re: Thousands of tables versus on table?
Previous Message Tom Lane 2007-06-06 14:40:41 Re: weird query plan