Re: Server unreliability

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
Cc: PostgreSQL www <pgsql-www(at)postgresql(dot)org>, PostgreSQL advocacy <pgsql-advocacy(at)postgresql(dot)org>
Subject: Re: Server unreliability
Date: 2004-09-29 20:41:11
Message-ID: 200409292041.i8TKfBH20753@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-advocacy pgsql-www

Marc G. Fournier wrote:
> On Wed, 29 Sep 2004, Bruce Momjian wrote:
>
> > It is my opinion that we have to make major changes in the way we
> > provide hosting for our servers. There are several problems:
> >
> > o Location of servers
> >
> > The location of our servers in Panama is a problem. They are too far
> > for any PostgreSQL maintainers to access. Changing hardware or
> > diagnosing problems has been too hard. I have had like 2 days of
> > downtime on my home machine in the past 12 years. We have had more than
> > 2 days of downtime in the past 6 months. My wife would not accept such
> > a reliability level.
>
> This is currently being worked on ... we are looking at various remote
> management solutions so that we don't have to deal with waiting for a
> technician to get 'on the scene' ...

OK.

> > o FreeBSD
> >
> > The use of FreeBSD jails can cause servers to take +8 hours to fsck on a
> > server crash or power failure. Again, I would never accept such
> > problems on my home server so it is hard to fathom how a project with
> > thousands of users can accept that. Either we need to find a fix, stop
> > using jails, or get another operating system, but continuing to use a
> > setup with a known problem is just asking for trouble.
>
> Actually, again, this one is being addressed ... there is a solution in
> the pipeline to fix the cause of the 8+ hour fsck, but, since it is a fix
> to fsck itself, it hasn't been put into the mainstream code yet, due to
> *obvious* testing reasons ...

Even if it is experimental, I think we should try it. But the larger
issue is why we are having OS crashes in the first place. Not all the
fsck downtime was because of power failure. What are the causes of the
other downtime, and if we don't know or can't fix it, I think we need to
look at more dramatic changes to increase reliability.

> We've also added in hot failover as an option ... I've posted to -www
> asking about putting www.postgresql.org onto it, but so far, the only
> responses back have been along the lines of 'how are you doing it?' ...
>
> The "risk" is that its not real time replication between the live and
> failover server ... on our high performance servers, the 'delay' is about
> 5 minutes ...
>
> ... now, knowing that, if you feel comfortable with me putting this onto
> the mailing lists/cvsroot as well, knowing that there is a possibility of
> something being written 'in the gap' before failover, I'll do that VM also
> ...
>
> note that altho the replication has a gap, the heartbeat process runs
> every minute ... as soon as it can't ping anymore, it fails over ...

Right, failover is nice, but we need to find the cause of why we need
the failover so often. From my perspective, if your servers can't be as
reliable as my home machine, there is something fundementally wrong,
either in hardware purchase, hosting provider, operating system,
software infrastructure, something. I don't know what the problem is,
but I know a problem when I see it.

> > o Web site
> >
> > We have been talking about a new web page layout for years at this
> > point. I almost don't care if they just put a dancing bear up on the
> > web site. Let's do something!
>
> What's wrong with the existing one? Have you designed the dancing bear
> you'd like us to put up in place of what we have now?

Looking around now. Perhaps a dancing elephant. WARNING: This will
make you ill:

http://janetskiles.com/ART/greeting/greet-ani/dancing-elephant.jpg

:-)

> > The archives situation is a continual problem. Again, maybe a dancing
> > bear can help. :-)
>
> What is wrong with it now? I'm cleaning up the code itself, but that is
> due to it being a mess right now, not due to any problems reported to me,
> removed one of the banner ads so that loading is a bit faster, and John
> has done, I think, a fantastic job on the search engine itself, including
> sending me changes for the archives themselves so that the 'time searches'
> should now work properly ...
>
> So, do you have something specific you'd like to point out to us that
> we've overlooked and haven't fixed yet?

"It not working" seems to be a continual problem. I don't know the
details and maybe it is already being fixed.

> > Basically, with no money and no one offering servers
>
> So far, I've had one person donate $10 ... in order to put a dedicated
> server onto the network, I'd need alot more of those ... that would pretty
> much eliminate your second point about the fsck's, since its only our
> *loaded* servers, that we have that problem with ... but, as I said, the
> fsck issue is being addressed as well ...

Yep, that is pretty pathetic.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Browse pgsql-advocacy by date

  From Date Subject
Next Message Bruce Momjian 2004-09-29 20:44:27 Re: Server unreliability
Previous Message Tom Lane 2004-09-29 20:40:42 Re: Server unreliability

Browse pgsql-www by date

  From Date Subject
Next Message Bruce Momjian 2004-09-29 20:44:27 Re: Server unreliability
Previous Message Tom Lane 2004-09-29 20:40:42 Re: Server unreliability