Skip site navigation (1) Skip section navigation (2)

Re: I/O on select count(*)

From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Matthew Wakeling" <matthew(at)flymine(dot)org>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: I/O on select count(*)
Date: 2008-05-15 15:08:36
Message-ID: 482C51F4.9070809@enterprisedb.com (view raw or flat)
Thread:
Lists: pgsql-performance
Matthew Wakeling wrote:
> On Thu, 15 May 2008, Heikki Linnakangas wrote:
>> > Is it really safe to update the hint bits in place? If there is a > 
>> power cut in the middle of writing a block, is there a guarantee from 
>> > the disc that the block will never be garbled?
>>
>> Don't know, to be honest. We've never seen any reports of corrupted 
>> data that would suggest such a problem, but it doesn't seem impossible 
>> to me that some exotic storage system might do that.
> 
> Hmm. That problem is what WAL full-page-writes is meant to handle, isn't 
> it? So basically, if you're telling people that WAL full-page-writes is 
> safer than partial WAL, because it avoids updating pages in-place, then 
> you shouldn't be updating pages in-place for the hint bits either. You 
> can't win!

Full-page-writes protect from torn pages, that is, when one half of an 
update hits the disk but the other one doesn't. In particular, if the 
beginning of the page where the WAL pointer (XLogRecPtr) is flushed to 
disk, but the actual changes elsewhere in the page aren't, you're in 
trouble. WAL replay will look at the WAL pointer, and think that the 
page doesn't need to be replayed, while other half of the update is 
still missing.

Hint bits are different. We're only updating a single bit, and it 
doesn't matter from correctness point of view whether the hint bit 
update hits the disk or not. But what would spell trouble is if the disk 
controller/whatever garbles the whole sector, IOW changes something else 
than the changed bit, while doing the update.

>>> In fact, if the tuple's creating transaction has aborted, then the 
>>> tuple can be vacuumed right there and then before it is even written. 
>>
>> Not if you have any indexes on the table. To vacuum, you'll have to 
>> scan all indexes to remove pointers to the tuple.
> 
> Ah. Well, would that be so expensive? After all, someone has to do it 
> eventually, and these are index entries that have only just been added 
> anyway.

Scanning all indexes? Depends on your table of course, but yes it would 
be expensive in general.

> An alternative would be to build a "list of changes" in the WAL without 
> actually changing the underlying index at all. When reading the index, 
> you would read the "list" first (which would be in memory, and in an 
> efficient-to-search structure), then read the original index and add the 
> two. Then when checkpointing, vet all the changes against known aborted 
> transactions before making all the changes to the index together. This 
> is likely to speed up index writes quite a bit, and also allow you to 
> effectively vacuum aborted tuples before they get written to the disc.

There's not much point optimizing something that only helps with aborted 
transactions.

The general problem with any idea that involves keeping a list of 
changes made in a transaction is that that list will grow big during 
bulk loads, so you'll have to overflow to disk or abandon the list 
approach. Which means that it won't help with bulk loads.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

In response to

Responses

pgsql-performance by date

Next:From: Heikki LinnakangasDate: 2008-05-15 15:15:50
Subject: Re: I/O on select count(*)
Previous:From: Guillaume CottenceauDate: 2008-05-15 15:08:16
Subject: Re: which ext3 fs type should I use for postgresql

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group