Re: Performance and WAL on big inserts/updates

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: sailesh(at)cs(dot)berkeley(dot)edu
Cc: Marty Scholes <marty(at)outputservices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Performance and WAL on big inserts/updates
Date: 2004-03-12 04:32:44
Message-ID: 4931.1079065964@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Sailesh Krishnamurthy <sailesh(at)cs(dot)berkeley(dot)edu> writes:
> (Just a note: my comments are not pg-specific .. indeed I don't know
> much about pg recovery).
> ...
> BTW, logging raw datafile blocks would be pretty gross (physical
> logging) and so ARIES logs the changes to each tuple in "logical"
> fashion .. so if only one column changes only that value (before and
> after images) are logged. This is what's called "physiological
> logging".

What PG currently uses is sort of a hybrid approach. The basic WAL
entry normally gives tuple-level detail: a tuple image and location
for an INSERT, a new tuple image and old and new locations for UPDATE,
a tuple location for DELETE, etc. This might be a bit bulkier than
necessary for UPDATEs that change few columns, but it's consistent and
easy to replay.

However, as I mentioned to Marty, the very first change to any disk page
after a checkpoint will dump an entire (post-change) image of that page
into the log, in place of the tuple image or other detail that would
allow us to redo the change to that page. The purpose of this is to
ensure that if a page is only partially written during a power failure,
we have the ability to recreate the entire correct page contents from
WAL by replaying everything since the last checkpoint. Replay is thus
a write-only operation as far as the data files are concerned --- we
only write full pages or modify previously written pages.

(You may be thinking "well, what about a partial write to WAL" --- but
if that happens, that's where we stop replaying WAL. The checksums on
WAL entries allow us to detect the invalid data before we use it.
So we will still have a consistent valid state on disk, showing the
effects of everything up through the last undamaged WAL record.)

BTW this is a one-directional log. We do not care about UNDO, only
replay. There is nothing that need be undone from a failed transaction;
it can only have littered the database with dead tuples, which will
eventually be reclaimed by VACUUM.

> Having said that, page-oriented undo logging can be a pain when
> B-tree pages split.

We don't bother to undo index page splits either ...

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Marc G. Fournier 2004-03-12 04:34:03 Re: [HACKERS] The Name Game: postgresql.net vs.
Previous Message Marc G. Fournier 2004-03-12 04:29:50 Re: [HACKERS] The Name Game: postgresql.net vs.