data loss due to improper handling of postmaster ....

From: "Rajesh Kumar Mallah(dot)" <mallah(at)trade-india(dot)com>
To: <pgsql-admin(at)postgresql(dot)org>, Bojan Belovic <bbelovic(at)usa(dot)net>
Cc: pgsql-sql(at)postgresql(dot)org, bikky(at)hotmail(dot)com
Subject: data loss due to improper handling of postmaster ....
Date: 2002-05-08 06:15:37
Message-ID: 200205081145.37254.mallah@trade-india.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-sql

Hi folks,

this is a long email.
I too experienced a data loss of 11 hrs recently.
i have the most recent postgresql 7.2.1 on RedHat 6.2

but my case was bit different and i feel my wrong handling
of situation was also responsible for it.

I would be grateful if someone could tell me what should have
been done *instead* to prevent the data loss.

as far as i remember the following is the post mortem :

the load average of my database server had reached 5.15 and my website had
become slugglish so i decided to stop the postmaster and start again,
(i dont know it it was a right thing but was inituitive to me)

so i did

# su - postgres
# pg_ctl stop <-- did not work out
it said postmaster could not be stopped.

# pg_ctl stop -m immediate
it said postmaster is stopped ,
but it was wrong ps auxwww still showed some processes running.

# pg_clt -l /var/log/pgsql start
said started successfully (but in reality not )

at this point postmaster is neither dead nor running essentially my live
website was down, so under pressure i decided to reboot the system
and told my ISP to do so.

but even the reboot was not smooth , the unix admin of my isp says
some process does not let the system reboot (and it was postmaster).
so he has to put the machine in power cycle and the machine fscked
in startup.

as a result i too got similar messages as Bojan has given below .
and my website was not connecting to the database.
it used to say "database in recovery mode.... "

then i did "pg_ctl stop" then start but nothing worked out.

since it was my production database i had to restore the database
in minimum time so i used my old backup that was 11 hrs old and
hence a major data loss.

I strongly beleive Postgresql is the best open source database
around and is *safe* unless fiddled in a wrong manner.

But there are problems in using it.

due to The current Lack of inbuilt failover and replication solutions in
postgresql people like me would tend to become desperate because
one cannot keep webserver down for long as a result we take wrong steps.

For mere mortals like me there should be set of guidelines for safe
handling of the server. (DOS' and DON'TS type) to prevent
DATA LOSS.

Also i would like suggestions on how to live with postgresql
with its current limitations of replication ( or failover solutions) and
without data loss.

what i currently do is backup my database with pg_dump but there are
problems with it.

Because of large size of my database pg_dump takes
20-30 mins and the server load increases this means
i cannot do it quite frequently on my production server.
so in worst case i still loose of duration ranging from 1-24 hrs
depending on frequency of pg_dump.
And for many of us even 1Hour of data is *quite* a loss for us.

I would also want comments on usability of USOGRES / RSERV
replication systems with postgres 7.2.1

hoping to get some tips from the intellectuals out here

regds
mallah.

On Tuesday 07 May 2002 07:52 pm, Bojan Belovic wrote:
> My database apparently crashed - don't know how or why. It happend in the
> middle of the night so I wasn't around to troubleshoot it at the time. It
> looks like it died during the scheduled vacuum.
>
> Here's the log that gets generated when I attempt to bring it back up:
>
> postmaster successfully started
> DEBUG: database system shutdown was interrupted at 2002-05-07 09:35:35 EDT
> DEBUG: CheckPoint record at (10, 1531023244)
> DEBUG: Redo record at (10, 1531023244); Undo record at (10, 1531022908);
> Shutdown FALSE
> DEBUG: NextTransactionId: 29939385; NextOid: 9729307
> DEBUG: database system was not properly shut down; automatic recovery in
> progress...
> DEBUG: redo starts at (10, 1531023308)
> DEBUG: ReadRecord: record with zero len at (10, 1575756128)
> DEBUG: redo done at (10, 1575756064)
> FATAL 2: write(logfile 10 seg 93 off 15474688) failed: Success
> /usr/bin/postmaster: Startup proc 1339 exited with status 512 - abort
>
> Any suggestions? What are my options, other than doing a complete restore
> of the DB from a dump (which is not really an option as the backup is not
> as recent as it should be).
>
> Thanks!
>
> Bojan

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Vilson farias 2002-05-08 13:09:16 Performance question related with temporary tables
Previous Message Tom Lane 2002-05-08 05:59:46 Re: db recovery (FATAL 2)

Browse pgsql-sql by date

  From Date Subject
Next Message Holger Marzen 2002-05-08 07:05:50 Re: Performance issues with compaq server
Previous Message Tom Lane 2002-05-08 05:59:46 Re: db recovery (FATAL 2)