I've been looking into Pavan Deolasee's recent discovery that when
storing a maximum-length toast tuple, heap_insert uselessly recurses
to toast_insert_or_update, wasting a nontrivial number of cycles.
It turns out there are several interrelated mistakes here, which are
wasting space as well as cycles.
First off, as to the exact nature of what's happening: the toast code
is designed so that when breaking down a large datum, it's divided into
rows with data payloads of exactly TOAST_MAX_CHUNK_SIZE bytes each.
On a 4-byte-MAXALIGN machine, this means the rows have total t_len of
exactly TOAST_TUPLE_THRESHOLD, which is what was intended. However,
that value is not a multiple of 4. Hence when heapam.c compares
MAXALIGN(tup->t_len) > TOAST_TUPLE_THRESHOLD
it decides the tuple needs re-toasting.
I noted before that this does not happen on an 8-byte-MAXALIGN machine,
but had not understood exactly why. The reason is the outer MAXALIGN
call in the definition
#define TOAST_MAX_CHUNK_SIZE (TOAST_TUPLE_THRESHOLD - \
MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) + \
sizeof(Oid) + \
sizeof(int32) + \
On a 4-byte machine that call doesn't do anything, but on an 8-byte
machine it causes the value of TOAST_MAX_CHUNK_SIZE to be reduced by
4, which means that t_len of a toast row comes out 4 bytes smaller
than on a 4-byte machine, which makes it smaller than
TOAST_TUPLE_THRESHOLD even after maxalign'ing. Hence no recursion.
That MAXALIGN is actually *wrong* now that I look at it: it's
effectively supposing that there is padding alignment after the varlena
length word for the chunk data, which of course there is not. But we
can't change TOAST_MAX_CHUNK_SIZE without forcing an initdb.
Instead we can fix the recursion by removing the MAXALIGN() operations in
heapam.c and tuptoaster.c that compare tuple lengths to the thresholds.
This effectively moves the threshold for tuple compression up a couple
bytes, which is a safe change to make, and makes the comparisons
slightly cheaper to boot. I propose doing that in 8.2 (and maybe older
branches after we get a bit more testing of it).
But the real problem is that we've got sloppy choices of the thresholds
and sizes. In the first place, TOAST_MAX_CHUNK_SIZE is being set at a
value that makes every toast row have two wasted padding bytes after it
(turns out it's the same on both 4- and 8-byte machines, though the
specific size of the rows differs). This is silly, we should be using a
TOAST_MAX_CHUNK_SIZE that makes the actual row length come out at
exactly a MAXALIGN multiple. In the second place, examination of toast
tables will show you that on a page with four maximum-length toast rows,
there are 12 free bytes on a 4-byte machine and 28 free on an 8-byte
machine (not counting the aforementioned padding bytes after each row).
That's fine at first glance; because of alignment considerations it's
actually the best we can do. The trouble is that TOAST_TUPLE_THRESHOLD
is derived from MaxTupleSize, which is derived on the assumption that we
should leave 32 bytes for "special space" on heap pages. If we actually
had such special space, it wouldn't fit. This happens because the
threshold calculation is just
#define TOAST_TUPLE_THRESHOLD (MaxTupleSize / 4)
which fails to account for the "line pointers" needed for all but the
first tuple. These errors cancel out at the moment, but wouldn't if we
changed anything about the page header or special space layout.
What I suggest we do about this in HEAD is:
1. Rename MaxTupleSize to MaxHeapTupleSize, and get rid of the
MaxSpecialSpace allotment in its calculation. We don't use special
space on heap pages and we shouldn't be artificially restricting tuple
length to allow for something that's unlikely to appear in the future.
(Note: yes, I know it's been suggested to keep free-space maps in some
heap pages, but that need not factor into a MaxHeapTupleSize limit: big
tuples can simply go into a page without any free-space map.)
2. Fix TOAST_TUPLE_THRESHOLD and TOAST_TUPLE_TARGET to be correctly
calculated (properly allowing for line pointers) and to be MAXALIGN
multiples. The threshold value should be exactly the size of the
largest tuple that you can put four of onto one page. Fix
TOAST_MAX_CHUNK_SIZE so that it is *not* necessarily a MAXALIGN
multiple, but rather causes the total length of a toast tuple to come
out that way. This guarantees minimum space wastage on toast pages.
This will force initdb due to changing chunk sizes in toast tables, but
unless we're going to reject Heikki's patch to merge cmin/cmax, there is
no hope of an in-place upgrade for 8.3 anyway.
BTW, while I was looking at this I noticed that BTMaxItemSize is
incorrectly calculated as well: it's coming out a couple bytes smaller
than it could safely be. And with a different page header size it
could come out a couple bytes larger, instead :-(. Perhaps this is
related to Heikki's recent observation that there always seemed to be
some extra free space on btree pages? I think the correct calculation
#define BTMaxItemSize(page) \
MAXALIGN_DOWN((PageGetPageSize(page) - \
MAXALIGN(sizeof(PageHeaderData) + 2 * sizeof(ItemIdData)) -
MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
where MAXALIGN_DOWN rounds down to the nearest maxalign multiple,
instead of up.
regards, tom lane
pgsql-hackers by date
|Next:||From: Simon Riggs||Date: 2007-02-02 20:19:44|
|Subject: Re: Referential Integrity and SHARE locks|
|Previous:||From: Simon Riggs||Date: 2007-02-02 19:41:26|
|Subject: Re: Performance penalty of visibility info in indexes?|