Re: I/O on select count(*)

From: Matthew Wakeling <matthew(at)flymine(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: I/O on select count(*)
Date: 2008-05-15 12:37:34
Message-ID: Pine.LNX.4.64.0805151245360.16756@aragorn.flymine.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Wed, 14 May 2008, Alvaro Herrera wrote:
> Hint bits are used to mark tuples as created and/or deleted by
> transactions that are know committed or aborted. To determine the
> visibility of a tuple without such bits set, you need to consult pg_clog
> and possibly pg_subtrans, so it is an expensive check.

So, as I understand it, Postgres works like this:

1. You begin a transaction. Postgres writes an entry into pg_clog.
2. You write some tuples. Postgres writes them to the WAL, but doesn't
bother fsyncing.
3. At some point, the bgwriter or a checkpoint may write the tuples to the
database tables, and fsync the lot.
4. You commit the transaction. Postgres alters pg_clog again, writes that
to the WAL, and fsyncs the WAL.
5. If the tuples hadn't already made it to the database tables, then a
checkpoint or bgwriter will do it later on, and fsync the lot.
6. You read the tuples. Postgres reads them from the database table, looks
in pg_clog, notices that the transaction has been committed, and
writes the tuples to the database table again with the hint bits set.
This write is not WAL protected, and is not fsynced.

This seems like a good architecture, with some cool characteristics,
mainly that at no point does Postgres have to hold vast quantities of data
in memory. I have two questions though:

Is it really safe to update the hint bits in place? If there is a power
cut in the middle of writing a block, is there a guarantee from the disc
that the block will never be garbled?

Is there a way to make a shortcut and have the hint bits written the first
time the data is written to the table? One piece of obvious low-hanging
fruit would be to enhance step five above, so that the bgwriter or
checkpoint that writes the data to the database table checks the pg_clog
and writes the correct hint bits. In fact, if the tuple's creating
transaction has aborted, then the tuple can be vacuumed right there and
then before it is even written. For OLTP, almost all the hint bits will be
written first time, and also the set of transactions that will be looked
up in the pg_clog will be small (the set of transactions that were active
since the last checkpoint), so its cache coherency will be good.

However, this idea does not deal well with bulk data loads, where the data
is checkpointed before transaction is committed or aborted.

Matthew

--
Now, you would have thought these coefficients would be integers, given that
we're working out integer results. Using a fraction would seem really
stupid. Well, I'm quite willing to be stupid here - in fact, I'm going to
use complex numbers. -- Computer Science Lecturer

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Heikki Linnakangas 2008-05-15 12:52:31 Re: I/O on select count(*)
Previous Message Matthew Wakeling 2008-05-15 12:23:57 Re: which ext3 fs type should I use for postgresql