Re: I/O on select count(*)

From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Matthew Wakeling" <matthew(at)flymine(dot)org>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: I/O on select count(*)
Date: 2008-05-15 15:08:36
Message-ID: 482C51F4.9070809@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Matthew Wakeling wrote:
> On Thu, 15 May 2008, Heikki Linnakangas wrote:
>> > Is it really safe to update the hint bits in place? If there is a >
>> power cut in the middle of writing a block, is there a guarantee from
>> > the disc that the block will never be garbled?
>>
>> Don't know, to be honest. We've never seen any reports of corrupted
>> data that would suggest such a problem, but it doesn't seem impossible
>> to me that some exotic storage system might do that.
>
> Hmm. That problem is what WAL full-page-writes is meant to handle, isn't
> it? So basically, if you're telling people that WAL full-page-writes is
> safer than partial WAL, because it avoids updating pages in-place, then
> you shouldn't be updating pages in-place for the hint bits either. You
> can't win!

Full-page-writes protect from torn pages, that is, when one half of an
update hits the disk but the other one doesn't. In particular, if the
beginning of the page where the WAL pointer (XLogRecPtr) is flushed to
disk, but the actual changes elsewhere in the page aren't, you're in
trouble. WAL replay will look at the WAL pointer, and think that the
page doesn't need to be replayed, while other half of the update is
still missing.

Hint bits are different. We're only updating a single bit, and it
doesn't matter from correctness point of view whether the hint bit
update hits the disk or not. But what would spell trouble is if the disk
controller/whatever garbles the whole sector, IOW changes something else
than the changed bit, while doing the update.

>>> In fact, if the tuple's creating transaction has aborted, then the
>>> tuple can be vacuumed right there and then before it is even written.
>>
>> Not if you have any indexes on the table. To vacuum, you'll have to
>> scan all indexes to remove pointers to the tuple.
>
> Ah. Well, would that be so expensive? After all, someone has to do it
> eventually, and these are index entries that have only just been added
> anyway.

Scanning all indexes? Depends on your table of course, but yes it would
be expensive in general.

> An alternative would be to build a "list of changes" in the WAL without
> actually changing the underlying index at all. When reading the index,
> you would read the "list" first (which would be in memory, and in an
> efficient-to-search structure), then read the original index and add the
> two. Then when checkpointing, vet all the changes against known aborted
> transactions before making all the changes to the index together. This
> is likely to speed up index writes quite a bit, and also allow you to
> effectively vacuum aborted tuples before they get written to the disc.

There's not much point optimizing something that only helps with aborted
transactions.

The general problem with any idea that involves keeping a list of
changes made in a transaction is that that list will grow big during
bulk loads, so you'll have to overflow to disk or abandon the list
approach. Which means that it won't help with bulk loads.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Heikki Linnakangas 2008-05-15 15:15:50 Re: I/O on select count(*)
Previous Message Guillaume Cottenceau 2008-05-15 15:08:16 Re: which ext3 fs type should I use for postgresql