Getting rid of cmin and cmax

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Manfred Koizar <mkoi-pg(at)aon(dot)at>
Subject: Getting rid of cmin and cmax
Date: 2006-09-19 13:34:14
Message-ID: 450FF1D6.5070500@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We currently use 4 int32s to store xmin, xmax, cmin, cmax and xvac on
every heap tuple. That's a lot of overhead, especially on tables with
narrow rows. Reduction in header size would give us considerable space
and I/O savings.

I'm thinking of removing cmin and cmax, and keeping that information in
backend-private memory instead. cmin and cmax are only interesting to
the inserting/deleting transaction, so using precious tuple header space
for that is a waste. This has been discussed before, and Manfred Koizar
even had a patch for 7.4 but didn't submit it because it was incomplete
(http://archives.postgresql.org/pgsql-hackers/2005-09/msg00172.php).
BTW: Manfred, do you still have the patch? It'd be interesting to look
at, even if it's not finished.

This reduces the tuple header size by 4 bytes, or 8 bytes if we can
later get rid of xvac as well.

There's some interesting properties we can exploit in implementing the
backend-private storage:

1. in small OLTP transactions that touch few rows, the cmin/cmax
information will easily fit in memory.

2. if current commandid == 1, every tuple with xmin == current xid is
not visible. Similarly, every tuple with xmax = current xid is visible.
So we don't need to store anything for the first command in a transaction.

3. we can forget modifications by command X as soon as there's no live
snapshots with curcid <= X.

4. we don't need to record the information, if there's no live
snapshots, and we know that the current command is not going to read
existing rows.

These optimizations take care of bulk inserts nicely. In particular,
pg_restore and similar applications wouldn't need to keep track of
inserted records.

By choosing a clever data structure, we might be able to get away
without spilling to disk. For example, a hash table works great for
small transactions. If a transaction does bulk modifications, a bitmap
per relation could be used. And to optimize for the cases when the
information is not actually used, we can just collect the information to
an array, and turn it into a more lookup-friendly data structure the
first time it's needed.

Even if we do have to spill to disk on large transactions that do
massive updates and selects, it would still be a win for most databases.
And it penalizes the transactions that do massive updates, instead of
every transaction in the system as we do now.

Comments?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alban Hertroys 2006-09-19 13:34:47 Re: vista
Previous Message Tomi NA 2006-09-19 13:30:15 Re: vista