Re: Index Scans become Seq Scans after VACUUM ANALYSE

From: "J(dot) R(dot) Nield" <jrnield(at)usol(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Curt Sampson <cjs(at)cynic(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Loftis <mloftis(at)wgops(dot)com>, mlw <markw(at)mohawksoft(dot)com>, PostgreSQL Hacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE
Date: 2002-06-24 01:29:23
Message-ID: 1024882167.1793.733.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote:
> Yes, I don't see writing to two files vs. one to be any win, especially
> when we need to fsync both of them. What I would really like is to
> avoid the double I/O of writing to WAL and to the data file; improving
> that would be a huge win.
>

If is impossible to do what you want. You can not protect against
partial writes without writing pages twice and calling fdatasync between
them while going through a generic filesystem. The best disk array will
not protect you if the operating system does not align block writes to
the structure of the underlying device. Even with raw devices, you need
special support or knowledge of the operating system and/or the disk
device to ensure that each write request will be atomic to the
underlying hardware.

All other systems rely on the fact that you can recover a damaged file
using the log archive. This means downtime in the rare case, but no data
loss. Until PostgreSQL can do this, then it will not be acceptable for
real critical production use. This is not to knock PostgreSQL, because
it is a very good database system, and clearly the best open-source one.
It even has feature advantages over the commercial systems. But at the
end of the day, unless you have complete understanding of the I/O system
from write(2) through to the disk system, the only sure ways to protect
against partial writes are by "careful writes" (in the WAL log or
elsewhere, writing pages twice), or by requiring (and allowing) users to
do log-replay recovery when a file is corrupted by a partial write. As
long as there is a UPS, and the operating system doesn't crash, then
there still should be no partial writes.

If we log pages to WAL, they are useless when archived (after a
checkpoint). So either we have a separate "log" for them (the ping-pong
file), or we should at least remove them when archived, which makes log
archiving more complex but is perfectly doable.

Finally, I would love to hear why we are using the operating system
buffer manager at all. The OS is acting as a secondary buffer manager
for us. Why is that? What flaw in our I/O system does this reveal? I
know that:

>We sync only WAL, not the other pages, except for the sync() call we do
> during checkpoint when we discard old WAL files.

But this is probably not a good thing. We should only be writing blocks
when they need to be on disk. We should not be expecting the OS to write
them "sometime later" and avoid blocking (as long) for the write. If we
need that, then our buffer management is wrong and we need to fix it.
The reason we are doing this is because we expect the OS buffer manager
to do asynchronous I/O for us, but then we don't control the order. That
is the reason why we have to call fdatasync(), to create "sequence
points".

The reason we have performance problems with either D_OSYNC or fdatasync
on the normal relations is because we have no dbflush process. This
causes an unacceptable amount of I/O blocking by other transactions.

The ORACLE people were not kidding when they said that they could not
certify Linux for production use until it supported O_DSYNC. Can you
explain why that was the case?

Finally, let me apologize if the above comes across as somewhat
belligerent. I know very well that I can't compete with you guys for
knowledge of the PosgreSQL system. I am still at a loss when I look at
the optimizer and executor modules, and it will take some time before I
can follow discussion of that area. Even then, I doubt my ability to
compare with people like Mr. Lane and Mr. Momjian in experience and
general intelligence, or in the field of database programming and
software development in particular. However, this discussion and a
search of the pgsql-hackers archives reveals this problem to be the KEY
area of PostgreSQL's failing, and general misunderstanding, when
compared to its commercial competitors.

Sincerely,

J. R. Nield

--
J. R. Nield
jrnield(at)usol(dot)com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Lockhart 2002-06-24 01:47:46 Re: ecpg and bison again
Previous Message Thomas Lockhart 2002-06-24 00:30:20 Re: SQL99, CREATE CAST, and initdb