Re: big database resulting in small dump

From: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To: Ilya Ivanov <forn(at)ngs(dot)ru>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: big database resulting in small dump
Date: 2012-07-21 01:41:09
Message-ID: 500A08B5.7050000@ringerc.id.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 07/21/2012 02:05 AM, Ilya Ivanov wrote:
> I have a 8.4 database (installed on ubuntu 10.04 x86_64). It holds
> Zabbix database. The database on disk takes 10Gb. SQL dump takes only
> 2Gb. I've gone through
> http://archives.postgresql.org/pgsql-general/2008-08/msg00316.php and
> got some hints. Naturally, the biggest table is history (the second
> biggest is history_uint. Together they make about 95% of total size).
> I've tried to perform CLUSTER on it, but seemed to be taking forever
> (3 hours and still not completed). So I cancelled it and went with
> database drop and restore. It resulted in database taking up 6.4Gb
> instead of 10Gb. This is a good improvement, but still isn't quite
> what I expect. I would appreciate some clarification.

To elaborate on the answers already posted:

Plain text dumps only contain the data its self. In many databases the
table contents are a small part of the overall database size.
Additionally, data is stored on disk in a structure optimised for speed
of access, not disk space consumption, so the same data can be much more
compact in the dump format.

Finally, data in dumps is much, much more efficiently compressed than it
is in the tables. In tables the main rows aren't compressed at all, and
TOASTed values like big text fields are individually compressed, which
is immensely less space efficient than compressing them all together. On
the other hand, it allows random access where the dump format just
doesn't - kind of important for a database!

The rest of the space is used by:

- Indexes, which can get quite big.

- Free space in tables from deleted rows that haven't yet been replaced
by a new inserted row.
If you have a non-default FILLFACTOR there can be lots of this.

- "bloat" - wasted space in tables and indexes, typically caused by
insufficiently frequent autovacuum

- ... probably more I've forgotten

When you dump and reload you not only get rid of any bloat in your
tables and indexes, but you effectively REINDEX your database.
PostgreSQL can often create much more compact and efficient index
structures when it does a CREATE INDEX on a full table (like when
restoring a dump) than when it does a CREATE INDEX on an empty table
followed by lots of inserts.

--
Craig Ringer

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Chris Angelico 2012-07-21 01:55:04 Re: Select Rows With Only One of Two Values
Previous Message David Johnston 2012-07-20 23:38:29 Re: A Better Way? (Multi-Left Join Lookup)