|PostgreSQL 8.2.23 Documentation|
|Prev||Fast Backward||Chapter 52. Database Physical Storage||Fast Forward||Next|
This section provides an overview of TOAST (The Oversized-Attribute Storage Technique).
PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code. The technique is affectionately known as TOAST (or "the best thing since sliced bread").
Only certain data types support TOAST — there is no need to impose the
overhead on data types that cannot produce large field values. To
support TOAST, a data type
must have a variable-length (varlena)
representation, in which the first 32-bit word of any stored
value contains the total length of the value in bytes (including
itself). TOAST does not
constrain the rest of the representation. All the C-level
functions supporting a TOAST-able data type must be careful to
handle TOASTed input values.
(This is normally done by invoking
PG_DETOAST_DATUM before doing anything with an
input value, but in some cases more efficient approaches are
TOAST usurps the high-order two bits of the varlena length word, thereby limiting the logical size of any value of a TOAST-able data type to 1 GB (230 - 1 bytes). When both bits are zero, the value is an ordinary un-TOASTed value of the data type. One of these bits, if set, indicates that the value has been compressed and must be decompressed before use. The other bit, if set, indicates that the value has been stored out-of-line. In this case the remainder of the value is actually just a pointer, and the correct data has to be found elsewhere. When both bits are set, the out-of-line data has been compressed too. In each case the length in the low-order bits of the varlena word indicates the actual size of the datum, not the size of the logical value that would be extracted by decompression or fetching of the out-of-line data.
If any of the columns of a table are TOAST-able, the table will have an associated TOAST table, whose OID is stored in the table's pg_class.reltoastrelid entry. Out-of-line TOASTed values are kept in the TOAST table, as described in more detail below.
The compression technique used is a fairly simple and very fast member of the LZ family of compression techniques. See src/backend/utils/adt/pg_lzcompress.c for the details.
Out-of-line values are divided (after compression if used) into chunks of at most TOAST_MAX_CHUNK_SIZE bytes (this value is a little less than BLCKSZ/4, or about 2000 bytes by default). Each chunk is stored as a separate row in the TOAST table for the owning table. Every TOAST table has the columns chunk_id (an OID identifying the particular TOASTed value), chunk_seq (a sequence number for the chunk within its value), and chunk_data (the actual data of the chunk). A unique index on chunk_id and chunk_seq provides fast retrieval of the values. A pointer datum representing an out-of-line TOASTed value therefore needs to store the OID of the TOAST table in which to look and the OID of the specific value (its chunk_id). For convenience, pointer datums also store the logical datum size (original uncompressed data length) and actual stored size (different if compression was applied). Allowing for the varlena header word, the total size of a TOAST pointer datum is therefore 20 bytes regardless of the actual size of the represented value.
The TOAST code is triggered only when a row value to be stored in a table is wider than BLCKSZ/4 bytes (normally 2 kB). The TOAST code will compress and/or move field values out-of-line until the row value is shorter than BLCKSZ/4 bytes or no more gains can be had. During an UPDATE operation, values of unchanged fields are normally preserved as-is; so an UPDATE of a row with out-of-line values incurs no TOAST costs if none of the out-of-line values change.
The TOAST code recognizes four different strategies for storing TOAST-able columns:
PLAIN prevents either compression or out-of-line storage. This is the only possible strategy for columns of non-TOAST-able data types.
EXTENDED allows both compression and out-of-line storage. This is the default for most TOAST-able data types. Compression will be attempted first, then out-of-line storage if the row is still too big.
EXTERNAL allows out-of-line storage but not compression. Use of EXTERNAL will make substring operations on wide text and bytea columns faster (at the penalty of increased storage space) because these operations are optimized to fetch only the required parts of the out-of-line value when it is not compressed.
MAIN allows compression but not out-of-line storage. (Actually, out-of-line storage will still be performed for such columns, but only as a last resort when there is no other way to make the row small enough.)
Each TOAST-able data type specifies a default strategy for columns of that data type, but the strategy for a given table column can be altered with ALTER TABLE SET STORAGE.
This scheme has a number of advantages compared to a more straightforward approach such as allowing row values to span pages. Assuming that queries are usually qualified by comparisons against relatively small key values, most of the work of the executor will be done using the main row entry. The big values of TOASTed attributes will only be pulled out (if selected at all) at the time the result set is sent to the client. Thus, the main table is much smaller and more of its rows fit in the shared buffer cache than would be the case without any out-of-line storage. Sort sets shrink also, and sorts will more often be done entirely in memory. A little test showed that a table containing typical HTML pages and their URLs was stored in about half of the raw data size including the TOAST table, and that the main table contained only about 10% of the entire data (the URLs and some small HTML pages). There was no run time difference compared to an un-TOASTed comparison table, in which all the HTML pages were cut down to 7 kB to fit.
The scheme has disadvantages too, and the documentation is a bit light on cases where it is a good idea *not* to use TOAST...
Suppose you have a table with a text column, where you frequently run regexp searches over the text column (but not frequently enough to merit upgrading the server to store the entire table in RAM, or refactoring the query or schema in some way that allows an index to be useful). Non-trivial regexp searches almost always result in sequential scans, and since we really need to examine the entire TOASTed value for the regexp, the cost of fetching TOASTed tuples is unavoidable.
Using the HTML example on this page, it might be worthwhile to set storage to main for the HTML page column if:
1. The pages of the table do not fit in the shared buffer cache,
2. A common query involves sequentially scanning the entire value of a TOASTed column (e.g. regexp search),
3. The underlying table storage device has a high random page cost (like most disks),
4. Most of the row values are smaller than a page (otherwise you must use TOAST so SET STORAGE MAIN gains nothing),
5. Row updates are infrequent, or vacuums are frequent (otherwise the losses due to updates outweigh the gains during sequential scans).
In these cases doubling the size of the table is a small price to pay if we can read it sequentially from disk with an order of magnitude less seeking.