Re: LVM snapshots

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Murthy Kambhampaty <murthy(dot)kambhampaty(at)goeci(dot)com>
Cc: Matt Clark <matt(at)ymogen(dot)net>, "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, pgsql-admin(at)postgresql(dot)org
Subject: Re: LVM snapshots
Date: 2003-04-04 18:02:18
Message-ID: 7901.1049479338@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Murthy Kambhampaty <murthy(dot)kambhampaty(at)goeci(dot)com> writes:
> If somebody wants to try to replicate this:
> 1. Create a table with several hundred thousand records
> 2. Start an "insert into tbl2 from select ..." (multiple ones in parallel
> maximize your chance of hitting this)
> 3. Freeze the filesystem, create a snapshot, unfreeze the filesystem, mount
> the snapshot and copy $PGDATA somewhere else
> 4. Start a postmaster on the copy of the database cluster; run VACUUM
> ANALYZE on tlb2
> you should see a series of
> NOTICE: Rel tbl2: Uninitialized page 'nnnnn' - fixing
> messages

Hmm. This amounts to testing what happens after a Postgres crash (the
copied PGDATA will look exactly like the system had gone down at the
instant of the freeze). In that scenario zeroed pages aren't too
improbable, although I wouldn't expect to see a large number of them.
Are you running with a nondefault WAL_BUFFERS setting?

The way it could happen is like this: you have backend(s) busily
extending the table and filling each page with new tuples. Each time we
extend a table, we actually write a page of zeroes to the kernel ---
this is so that we'll find out right away if we're out of disk space.
Then, *inside the Postgres shared-buffer arena*, the page is initialized
with correct header data and filled with tuples. Unless you are short
on shared buffers, it will probably not be written out to the kernel
until the next checkpoint. So there is an interval where the kernel
thinks that this page of the file exists but contains zeroes.

Meanwhile, WAL entries for the new tuples are being written into the
WAL buffers arena. But they aren't going to be forced to disk until
there's a transaction commit (or the WAL buffers become full).

If the first WAL entry describing a tuple insertion into the new page
gets out to disk (or out to the kernel at least) before the filesystem
freeze or Postgres crash, then WAL replay will initialize the zeroed
page and you won't see any complaints. But there is clearly an interval
where a zeroed page can be added to a table and not initialized by WAL
replay.

This is all okay and no data loss is possible through this mechanism.
We can only lose uncommitted tuples, since a commit would have forced
out the WAL entries. But it's hard to see how you could have more than
approximately WAL_BUFFERS uninitialized pages present due to this
scenario --- a WAL entry can remain unwritten only as long as less than
WAL_BUFFERS worth of subsequent WAL traffic has occurred, so the
physical extensions of the file can't get all that far ahead of what WAL
replay will be able to see.

If you're just seeing a few of these notices per try, then I don't think
there's anything to worry about (especially if they reference pages very
near the end of their tables). Otherwise, there may be some other
behavior involved, and we'd better look more closely.

regards, tom lane

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Robert Treat 2003-04-04 18:03:53 Re: Sizes for all databases
Previous Message Peter Eisentraut 2003-04-04 16:46:25 Re: Speed & Memory Management