Re: pg_dump and thousands of schemas

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Craig James <cjames(at)emolecules(dot)com>
Cc: Hugo <hugo(dot)tech(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: pg_dump and thousands of schemas
Date: 2012-05-25 03:20:34
Message-ID: CAMkU=1zgUUqT6ZPAFAMR6cFx01Bbk+y05Fmwk63PBHOMrBPPoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Thu, May 24, 2012 at 8:21 AM, Craig James <cjames(at)emolecules(dot)com> wrote:
>
>
> On Thu, May 24, 2012 at 12:06 AM, Hugo <Nabble> <hugo(dot)tech(at)gmail(dot)com> wrote:
>>
>> Hi everyone,
>>
>> We have a production database (postgresql 9.0) with more than 20,000
>> schemas
>> and 40Gb size. In the past we had all that information in just one schema
>> and pg_dump used to work just fine (2-3 hours to dump everything). Then we
>> decided to split the database into schemas, which makes a lot of sense for
>> the kind of information we store and the plans we have for the future. The
>> problem now is that pg_dump takes forever to finish (more than 24 hours)
>> and
>> we just can't have consistent daily backups like we had in the past. When
>> I
>> try to dump just one schema with almost nothing in it, it takes 12
>> minutes.

Sorry, your original did not show up here, so I'm piggy-backing on
Craig's reply.

Is dumping just one schema out of thousands an actual use case, or is
it just an attempt to find a faster way to dump all the schemata
through a back door?

pg_dump itself seems to have a lot of quadratic portions (plus another
one on the server which it hits pretty heavily), and it hard to know
where to start addressing them. It seems like addressing the overall
quadratic nature might be a globally better option, but addressing
just the problem with dumping one schema might be easier to kluge
together.

>> When I try to dump a big schema with lots of information, it takes 14
>> minutes. So pg_dump is clearly lost in the middle of so many schemas. The
>> load on the machine is low (it is a hot standby replica db) and we have
>> good
>> configurations for memory, cache, shared_buffers and everything else. The
>> performance of the database itself is good, it is only pg_dump that is
>> inefficient for the task. I have found an old discussion back in 2007 that
>> seems to be quite related to this problem:
>>
>>
>> http://postgresql.1045698.n5.nabble.com/5-minutes-to-pg-dump-nothing-tp1888814.html
>>
>> It seems that pg_dump hasn't been tested with a huge number of schemas
>> like
>> that. Does anyone have a solution or suggestions? Do you know if there are
>> patches specific for this case?
>
>
> How many total relations do you have?  I don't know if there is a limit to
> the number of schemas, but I suspect when you went from one schema to 20,000
> schemas, you also went from N relations to 20000*N relations.

Yes, that might be important to know--whether the total number of
relations changed, or just their distribution amongst the schemata.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2012-05-25 03:54:55 Re: pg_dump and thousands of schemas
Previous Message Sergey Koposov 2012-05-24 23:26:28 Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

Browse pgsql-performance by date

  From Date Subject
Next Message Bruce Momjian 2012-05-25 03:54:55 Re: pg_dump and thousands of schemas
Previous Message Claudio Freire 2012-05-24 21:30:19 Re: High load average in 64-core server , no I/O wait and CPU is idle