Chapter 10. Disk Storage

Estimating Disk Storage requirements for a database.

The following advice is provided for a machine running Linux and Postgresql 7.1. The filesystem in use is ext2.  

There is no easy way to estimate precisely the storage requirements for a given database.  If an accurate size estimate is needed, it is suggested that a dummy database is set up with suitable fields, numbers of tables and queries. It should be filled with suitable random data and operated as a database for a short time before attempting to measure the disk storage needed to store and manage all the information.

Even this method provides only an estimate, and the figures should be carefully reviewed. 

Some general guidelines can be provided to assist the assessment. The following figures are taken from the FAQ. They suppose the database containd one table of 100000 lines, each with 20 characters and one integer. Including end-of-line bytes, a text file containing the information would be 2.8 MegaBytes (MB). 

The Users Guide Ch 3 "Data Types" provides information on the storage requirements for each piece of information. We find that Numeric types: integer, is held in four bytes. With a 20 byte text field, the information in each row is held in 24 bytes. Each row has a header of 36 bytes (approximately) and a pointer to the table of 4 bytes. The total storage required for each row of data is 64 bytes. 

Each datapage size in postgres is 8192 bytes (8K) and each page can hold up to 128 rows. Our row size of 64 bytes allows the page to be completely filled. If, for example the row size were 129 bytes, we could fit 63 rows into the page, leaving 65 bytes unused. Similarly, our 100000 lines of text will try to fit into 781.25 pages, with the 782nd page being three-quarters empty. 

As another example, our 64 byte row could be used to store a date, an integer, and four real floating point numbers, with their header and pointer. The Users Guide will reveal that each field takes four bytes to store the information.

Our calculation tells us that we will use 782 pages, each 8K to store the data. That is about 6.4 MB. This is not the whole story. One table of 100000 rows is relatively inexpensive for postgres to manage.

We would reasonably expect that a database of 10000 tables, each of 10 rows using 64 bytes would also need 6.4 MB for storage. For comparison, we will assume that our second database has rows containing a date, an integer and four real numbers. We should suspect, however, that postgres will need more resources to manage a database with a large number of tables. If there are hidden overheads needed to manage tables our database of 10000 tables should reveal them. 

To discover these hidden resources we will need some tools to help us find what we are looking for. If our database is already in operation, disk useage (man du) is what we want to know. Try
$ du -s * | sort -r -n | head -10
in $PGDATA

That should give you the ten largest users of disk space in the postgres data directory. Further examination will reveal a directory for each database. Unfortunately, they all have numbers that match the oid instad of the database name. There is a program in contrib (oid2name) that takes these numbers and gives you the database name.    

The xlog segment can take up a lot of space. Each incremental increase in size is usually set at 16 MB. We can expect to be fortunate with our first database of one table in consuming "only" 16 MB, giving a total of about 22 MB. We can expect our database of 10000 tables to range in size from 22 MB to 38MB, and consider ourselves unfortunate if disk useage rises to 54 MB. That, however, is not the complete picture. 

We begin our next search by logging in to template1, and asking for the list of databases
template1> \l

This should show all the databases in the system, and the list should include template 0, template1, lo_demo and any other databases created on the system and still active. We are looking for those bits of information (parts of indexes) which remain behind even after 
$ vacuumdb database

is run on a database to clean "dead" information out of it. 

Creating and deleting tables within a database is one activity known to cause this problem. To see the extent of the problem, access each database and search for relpages, like so...
select relname, relpages from pg_class order by relpages desc limit 10;

This should give the top ten sets of offenders within each database. Each relpage is 8K in size. 

You need to do a lot of creating and deleting of tables to make this problem noticeable. If your database seems to be growing exponentially and you are adding and deleting tables very frequently, this may be your problem. The only way at present (March 2002) to fix the problem is to dump, delete, create, and restore the database.  

It's reasonable to omit calculations of this wastage from estimates of the disk space which will be used by the database. If a problem with index bloat is suspected regular checks and measures to control the problem should be put in place. 

Summary

The crucial question to ask when preparing any estimate is : "How accurate does this need to be?" If your estimate needs to be closer than +100% -50% you will need some form of historical information to improve the accuracy of your estimate. The most cost effective way of doing this is by generating a model of the database, operating it for a short time, and measuring the storage needed.