On-disk Tuple Size

From: Curt Sampson <cjs(at)cynic(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: jtp <john(at)akadine(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: On-disk Tuple Size
Date: 2002-04-20 08:07:17
Message-ID: Pine.NEB.4.43.0204201608060.467-100000@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

[I've moved this discussion about changing the line pointer from four
bytes to two from -general to -hackers, since it's fairly technical.
The entire message Tom is responding to is appended to this one.]

On Sat, 20 Apr 2002, Tom Lane wrote:

> Curt Sampson <cjs(at)cynic(dot)net> writes:
> > ... Then we could declare that all tuples must be aligned on a
> > four-byte boundary, use the top 14 bits of a 16-bit line pointer as the
> > address, and the bottom two bits for the LP_USED and LP_DELETED flag.
> > This would slightly simplify the code for determining the flags, and
> > incidently boost the maximum page size to 64K.
>
> Hmm. Maybe, but the net effect would only be to reduce the minimum row
> overhead from 36 to 34 bytes. Not sure it's worth worrying about.

Well, unless the implementation is hideously complex, I'd say that
every byte is worth worrying about, given the amount of overhead that's
currently there. 36 to 34 bytes could give something approaching a 5%
performance increase for tables with short rows. (Actually, do we prefer
the tables/rows or relations/tuples terminology here? I guess I kinda
tend to use the latter for physical stuff.)

If we could drop the OID from the tuple when it's not being used,
that would be another four bytes, bringing the performance increase
up towards 15% on tables with short rows.

Of course I understand that all this is contingent not only on such
changes being acceptable, but someone actually caring enough to
write them.

While we're at it, would someone have the time to explain to me
how the on-disk CommandIds are used? A quick look at the code
indicates that this is used for cursor consistency, among other
things, but it's still a bit mysterious to me.

> > ... I don't see why we would then
> > need the LP_DELETED flag at all.
>
> I believe we do want to distinguish three states: live tuple, dead
> tuple, and empty space. Otherwise there will be cases where you're
> forced to move data immediately to collapse empty space, when there's
> not a good reason to except that your representation can't cope.

I don't understand this. Why do you need to collapse empty space
immediately? Why not just wait until you can't find an empty fragment
in the page that's big enough, and then do the collapse?

Oh, on a final unrelated note, <john(at)akadine(dot)com>, you're bouncing
mail from my host for reasons not well explained ("550 Access
denied.") I tried postmaster at your site, but that bounces mail
too. If you want to work out the problem, drop me e-mail from some
address at which you can be responded to.

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

------- Previous Message --------
>From cjs(at)cynic(dot)net Sat Apr 20 16:56:29 2002
Date: Sat, 20 Apr 2002 13:55:38 +0900 (JST)
From: Curt Sampson <cjs(at)cynic(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: jtp <john(at)akadine(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] general design question

On Fri, 19 Apr 2002, Tom Lane wrote:

> Right. The *minimum* row overhead in Postgres is 36 bytes (32-byte
> tuple header plus 4-byte line pointer).

Ah, right! The line pointer is four bytes because it includes the length
of the tuple.

But I'm not sure why we need this length, possibly because I don't
understand the function of the LP_USED and LP_DELETED flags in the line
pointer. (I'm guessing that if LP_USED is not set, the line pointer does
not point to any data, and that if LP_DELETED is set, it points to a
chunk of free space.)

Why could we not just make all unallocated space be pointed to by
LP_DELETED pointers, and then when we need space, use it from those
(splitting and joining as necessary)? That gets rid of the need for
a length. Then we could declare that all tuples must be aligned on a
four-byte boundary, use the top 14 bits of a 16-bit line pointer as the
address, and the bottom two bits for the LP_USED and LP_DELETED flag.
This would slightly simplify the code for determining the flags, and
incidently boost the maximum page size to 64K.

If you're willing to use a mask and shift to determine the address,
rather than just a mask, you could make the maximum page size 128K,
use the top 15 bits of the line pointer as the address, and use the
remaining bit as the LP_USED flag, since I don't see why we would then
need the LP_DELETED flag at all.

Or am I smoking crack here?

> AFAIK, all databases have nontrivial per-row overheads; PG might be
> a bit worse than average, but this is a significant issue no matter
> which DB you use.

For certain types of tables, such the sort of table joining two
others for which I forget the proper term:

CREATE TABLE folder_contents (
folder_id int NOT NULL,
item_id int NOT NULL,
PRIMARY KEY (folder_id, item_id))

some databases are much better. In MS SQL server, for example, since
there are no variable length columns, the tuple format will be:

1 byte status bits A
1 byte status bits B
2 bytes fixed-length columns data length
4 bytes DATA: folder_id
4 bytes DATA: item_id
2 bytes number of columns
1 byte null bitmap (unfortunately doesn't go away in SQL
server even when there are no nullable columns)

(If there were variable length columns, you would have after this:
two bytes for the number of columns, 2 bytes per column for the
data offsets within the tuple, and then the variable data.)

So in Postgres this would take, what, 44 bytes per tuple? But in
SQL Server this takes 17 bytes per tuple (including the two byte
line pointer in what they call the page's "row offset array), or
about 40% of the space.

Needless to say, in my last job, where I was dealing with a table
like this with 85 million rows, I was happy for this to be a 1.3
GB table instead of a 3.5 GB table. Not that this made much
performance difference in that application anyway, since, with a
clustered index and typical folder sizes at a couple of dozen to
a hundred or so items, I was basically never going to read more
than one or two pages from disk to find the contents of a folder.

Hm. I guess this really should be on hackers, shouldn't it?

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Curt Sampson 2002-04-20 08:22:20 Re: On-Disk Tuple Size
Previous Message Martijn van Oosterhout 2002-04-20 07:11:13 Re: general design question

Browse pgsql-hackers by date

  From Date Subject
Next Message Curt Sampson 2002-04-20 08:22:20 Re: On-Disk Tuple Size
Previous Message Martijn van Oosterhout 2002-04-20 07:11:13 Re: general design question