RE: [CORE] WAL & RC1 status

From: Vadim Mikheev <vadim4o(at)email(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-core(at)postgreSQL(dot)org
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: RE: [CORE] WAL & RC1 status
Date: 2001-03-03 18:46:06
Message-ID: 386541213.983645166958.JavaMail.root@web274-ec
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I've reported the major problems to the mailing lists
> but gotten almost no feedback about what to do.

I can't comment without access to code -:(

> commit: 2001-02-26 17:19:57
> 0/0059996C: prv 0/00599948; xprv 0/00000000; xid 0;
> RM 0 info 00 len 32
> checkpoint: redo 0/0059996C; undo 0/00000000; sui 29;
> nextxid 18903; nextoid 35195; online
> -- this is the last normal-looking checkpoint record.
> -- Judging from the commit timestamps surrounding prior
> -- checkpoints, checkpoints were happening every five
> -- minutes approximately on the 5-minute mark, so

You can't count on this: postmaster runs checkpoint
"maker" in 5 minutes *after* prev checkpoint was created,
not from the moment "maker" started. And checkpoint can
take *minutes*.

> -- this one happened about 17:20.
> -- (There really should be a timestamp
> -- in the checkpoint records...)

Agreed.

> commit: 2001-02-26 17:26:02
> ReadRecord: record with zero len at 0/005A4B4C
> -- My dump program is unhappy here because the rest
> -- of the page is zero. Given that there is a
> -- continuation record at the start of the next
> -- page, there certainly should have been record(s)
> -- here. But it's worse than that: check the commit
> -- timestamps and the xid numbers before and after the
> -- discontinuity. Did time go backwards here?

Commit timestamps are created *before* XLogInsert call,
which can suspend backend for some time (in multi-user
env). Random xid-s are also ok, generally.

> -- Also notice the back-pointers in the first valid
> -- record on the next page; they point not into the
> -- zeroed space, which would suggest a mere failure
> -- to write a buffer after filling it, but into the
> -- middle of one of the valid records on the prior
> -- page. It almost looks like page 5A6000 came from
> -- a completely different run than page 5A4000.
> Unexpected page info flags 0001 at offset 5A6000
> Skipping unexpected continuation record at offset 5A6000
> 0/005A6904: prv 0/005A48B4(?); xprv 0/005A48B4; xid 19047;
^^^^^^^^^^ ^^^^^^^^^^
Same. So, TX 19047 really inserted record at 0/005A48B4
position.

> -- What's even nastier (and the immediate cause of
> -- Scott's inability to restart) is that the pg_control
> -- file's checkPoint pointer points to 0/005AF9F0, which
> -- is *not* the location of this checkpoint, but of
> -- the record after it.

Well, well. Checkpoint position is taken from
MyLastRecord - I wonder how could this internal var
take "invalid" data from concurrent backend.

Ok, we're leaving Krasnoyarsk in 8 hrs and should
arrive SF Feb 5 ~ 10pm.

Vadim

-----------------------------------------------
FREE! The World's Best Email Address @email.com
Reserve your name now at http://www.email.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2001-03-03 19:06:55 Re: [CORE] WAL & RC1 status
Previous Message Christof Petig 2001-03-03 11:30:58 Query Planning time increased 3 times on 7.1 compared to 7.0.3