Re: Problems starting up postgres

From: "Mikheev, Vadim" <vmikheev(at)SECTORBASE(dot)COM>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Denis Perchine <dyp(at)perchine(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Problems starting up postgres
Date: 2001-09-06 17:05:52
Message-ID: 3705826352029646A3E91C53F7189E3201676C@sectorbase2.sectorbase.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > Sep 6 02:09:30 mx postgres[13468]: [9] FATAL 2:
> > XLogFlush: request(1494286336, 786458) is not satisfied --
> > flushed to (23, 2432317444)

First note that Denis could just restart with wal_debug = 1
to see bad request, without code change. (We should ask ppl
to set wal_debug ON in the case of any WAL problem...)
Denis, could you provide us with debug output?

> Yeek. Looks like you have a page somewhere in the database
> with a bogus LSN value (xlog pointer) ... and, most likely,
> other corruption as well.

We got error during checkpoint, when backend flushes pages
changed by REDO (and *only those pages*). So, that page X (with
bad LSN) was "recovered" from log. We didn't see CRC errors,
so log is Ok, physically. We should know what is the X page
(by setting breakpoint as suggested by Tom) and than look
into debug output to see where we got bad LSN.
Maybe it comes from restored pages or from checkpoint LSN,
due to errors in XLogCtl initialization, but for sure it looks
like bug in WAL code.

> Vadim, what do you think of reducing this elog from STOP to a notice
> on a permanent basis? ISTM we saw cases during 7.1 beta where this

And increase probability that ppl will just miss/ignore NOTICE
and bug in WAL will continue to harm others?

> STOP prevented people from recovering, so I'm thinking it does more

And we fixed bug in WAL that time...

> harm than good to overall system reliability.

No reliability having bugs in WAL code, so I object. But I'd move
check into XLogWrite code to STOP if flush request is beyond write
point.

Denis, please help us to fix this bug. Some GDB-ing probably will be
required. If you have not enough time/disk resources but able to
give us copy of data-dir, it would be great (I have RedHat 7.? and
Solaris 2.6 hosts, Tom ?). In any case debug output is the first
thing I'd like to see. If it's big please send it to Tom and me only.
And of course you can contact with me in Russian -:)

Vadim

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2001-09-06 17:13:50 Re: TRUNCATE question
Previous Message Vince Vielhaber 2001-09-06 17:01:08 Re: Re: What needs to be done?