Skip site navigation (1) Skip section navigation (2)

Re: I/O on select count(*)

From: Matthew Wakeling <matthew(at)flymine(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: I/O on select count(*)
Date: 2008-05-15 12:37:34
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance
On Wed, 14 May 2008, Alvaro Herrera wrote:
> Hint bits are used to mark tuples as created and/or deleted by
> transactions that are know committed or aborted.  To determine the
> visibility of a tuple without such bits set, you need to consult pg_clog
> and possibly pg_subtrans, so it is an expensive check.

So, as I understand it, Postgres works like this:

1. You begin a transaction. Postgres writes an entry into pg_clog.
2. You write some tuples. Postgres writes them to the WAL, but doesn't
     bother fsyncing.
3. At some point, the bgwriter or a checkpoint may write the tuples to the
     database tables, and fsync the lot.
4. You commit the transaction. Postgres alters pg_clog again, writes that
     to the WAL, and fsyncs the WAL.
5. If the tuples hadn't already made it to the database tables, then a
     checkpoint or bgwriter will do it later on, and fsync the lot.
6. You read the tuples. Postgres reads them from the database table, looks
     in pg_clog, notices that the transaction has been committed, and
     writes the tuples to the database table again with the hint bits set.
     This write is not WAL protected, and is not fsynced.

This seems like a good architecture, with some cool characteristics, 
mainly that at no point does Postgres have to hold vast quantities of data 
in memory. I have two questions though:

Is it really safe to update the hint bits in place? If there is a power 
cut in the middle of writing a block, is there a guarantee from the disc 
that the block will never be garbled?

Is there a way to make a shortcut and have the hint bits written the first 
time the data is written to the table? One piece of obvious low-hanging 
fruit would be to enhance step five above, so that the bgwriter or 
checkpoint that writes the data to the database table checks the pg_clog 
and writes the correct hint bits. In fact, if the tuple's creating 
transaction has aborted, then the tuple can be vacuumed right there and 
then before it is even written. For OLTP, almost all the hint bits will be 
written first time, and also the set of transactions that will be looked 
up in the pg_clog will be small (the set of transactions that were active 
since the last checkpoint), so its cache coherency will be good.

However, this idea does not deal well with bulk data loads, where the data 
is checkpointed before transaction is committed or aborted.


Now, you would have thought these coefficients would be integers, given that
we're working out integer results. Using a fraction would seem really
stupid. Well, I'm quite willing to be stupid here - in fact, I'm going to
use complex numbers.                    -- Computer Science Lecturer

In response to


pgsql-performance by date

Next:From: Heikki LinnakangasDate: 2008-05-15 12:52:31
Subject: Re: I/O on select count(*)
Previous:From: Matthew WakelingDate: 2008-05-15 12:23:57
Subject: Re: which ext3 fs type should I use for postgresql

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group