Re: Infrastructure monitoring

From: "Magnus Hagander" <mha(at)sollentuna(dot)net>
To: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>, "Josh Berkus" <josh(at)agliodbs(dot)com>
Cc: "John Hansen" <john(at)geeknet(dot)com(dot)au>, <pgsql-www(at)postgresql(dot)org>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Subject: Re: Infrastructure monitoring
Date: 2006-01-14 11:16:25
Message-ID: 6BCB9D8A16AC4241919521715F4D8BCE92E9BA@algol.sollentuna.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-www

> >> Search has been down for at least 2 days now, and this certainly
> >> isn't the first time it's happened. There's also been cases of
> >> archives getting stuck, and probably other outages besides
> those that
> >> went on until someone email'd about it.
> >>
> >> Would it be difficult to setup something to monitor these various
> >> services? I know there's at least one OSS tool to do it, though I
> >> have no idea how hard it would be to tie that into the current
> >> infrastructure.
> >
> > We have an open offer of Hyperic licenses, and they support
> FreeBSD now.
>
> Not to discount the offer ... but, what exactly would that
> provide us? We already monitor the *servers*, its what is
> inside of the servers that needs better monitoring ...
> knowing nothing about Hyperic, does that provide something for that?

I assume you talk about the nagios monitoring? Or are there perhaps even
now multiple sets of monitoring? (Dave has a nagios installation up at
least).

We could easily extend that to monitor much more detailed. It's just
that someone has to define what we need to monitor. And in either case,
I see no reason we should require commercial software to do it - that's
still going to need the definition of what has to be monitored. Let's
stick to opensource when we can...

BTW, we already do content monitoring on the actual website mirrors. If
a mirror does not answer, *or* does not update properly, it will
automatically be removed from the DNS record, and thus get out of
"public view" after 10-30 minutes.

> In the case of the archives, for instance, the problem was a
> perl process that for some unknown reason got stuck randomly
> ... removed that in favor of an awk script, and it hasn't
> done it since ... i also redirected cron's email to
> scrappy(at)postgresql(dot)org, so that any errors show up in my
> mailbox instead of roots, so I get an hourly reminder that
> things are running well ...

Right. What we could do to easily enhance this is to have the update
script update a timestamp file somewhere on the system when it's done,
and then monitor that file using existing tools (the file should be
accessible through http://archives.postgresql.org/ the same way it is
for the general website). Then you can just define a "can get <nn>
minutes out of sync before we scream"..

> In the case of search ... John would be better at answering
> that, but when he and I talked this past week, he mentioned
> that he was moving it all over to two new servers, which I
> changed the DNS for on Wednesday ...

What I think would be good in cases like this is just information -
AFAIK nobody on the web team knew hte servers were being moved. (I may
be wrong here - I know I didn't know and I also spoke to Dave about it,
but those are the only ones I polled. Anyway, -www should know)

That would also make it possible to do the standard fiddling with DNS
TTLs to make the problem much smaller.

//Magnus

Responses

Browse pgsql-www by date

  From Date Subject
Next Message John Hansen 2006-01-14 11:36:52 Re: Infrastructure monitoring
Previous Message John Hansen 2006-01-14 09:22:09 Re: Infrastructure monitoring