Re: Mount options for Ext3?

From: Kevin Brown <kevin(at)sysexperts(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Mount options for Ext3?
Date: 2003-01-25 04:13:19
Message-ID: 20030125041319.GE28252@filer
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Tom Lane wrote:
> Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> > I was presuming that when a savepoint occurs, a marker is written to
> > the log indicating which transactions had been committed to the data
> > files, and that this marker was paid attention to during database
> > startup.
>
> Not quite. The marker says that all datafile updates described by
> log entries before point X have been flushed to disk by the checkpoint
> --- and, therefore, if we need to restart we need only replay log
> entries occurring after the last checkpoint's point X.
>
> This has nothing directly to do with which transactions are committed
> or not committed. If we based checkpoint behavior on that, we'd need
> to maintain an indefinitely large amount of WAL log to cope with
> long-running transactions.

Ah. My apologies for my imprecise wording. I should have said
"...indicating which transactions had been written to the data files"
instead of "...had been committed to the data files", and meant to say
"checkpoint" but instead said "savepoint". I'll try to do better
here.

> The actual checkpoint algorithm is
>
> take note of current logical end of WAL (this will be point X)
> write() all dirty buffers in shared buffer arena
> sync() to ensure that above writes, as well as previous ones,
> are on disk
> put checkpoint record referencing point X into WAL; write and
> fsync WAL
> update pg_control with new checkpoint record, fsync it
>
> Since pg_control is what's examined after restart, the checkpoint is
> effectively committed when the pg_control write hits disk. At any
> instant before that, a crash would result in replaying from the
> prior checkpoint's point X. The algorithm is correct if and only if
> the pg_control write hits disk after all the other writes mentioned.

[...]

> > So suppose the marker makes it to the log but not all of the data the
> > marker refers to makes it to the data files. Then the system crashes.
>
> I think that this analysis is not relevant to what we're doing.

Agreed. The context of that analysis is when synchronous writes by
the database are turned off and one is left to rely on the operating
system to do the right thing. Clearly it doesn't apply when
synchronous writes are enabled. As long as only one process handles a
checkpoint, an operating system that guarantees that a process' writes
are committed to disk in the same order that they were requested,
combined with a journalling filesystem that at least wrote all data
prior to committing the associated metadata transactions, would be
sufficient to guarantee the integrity of the database even if all
synchronous writes by the database were turned off. This would hold
even if the operating system reordered writes from multiple processes.
It suggests an operating system feature that could be considered
highly desirable (and relates to the discussion elsewhere about
trading off shared buffers against OS file cache: it's often better to
rely on the abilities of the OS rather than roll your own mechanism).

One question I have is: in the event of a crash, why not simply replay
all the transactions found in the WAL? Is the startup time of the
database that badly affected if pg_control is ignored?

If there exists somewhere a reasonably succinct description of the
reasoning behind the current transaction management scheme (including
an analysis of the pros and cons), I'd love to read it and quit
bugging you. :-)

--
Kevin Brown kevin(at)sysexperts(dot)com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Brown 2003-01-25 04:46:51 Re: Windows Build System was: Win32 port patches submitted
Previous Message Tom Lane 2003-01-25 02:58:55 Re: Mount options for Ext3?

Browse pgsql-performance by date

  From Date Subject
Next Message Curt Sampson 2003-01-25 04:20:49 Re: Having trouble with backups (was: Re: Crash Recovery)
Previous Message Tom Lane 2003-01-25 03:10:27 Re: WEIRD CRASH?!?!