Re: Streaming replication status

From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Streaming replication status
Date: 2010-01-12 22:54:35
Message-ID: 4B4CFDAB.1070901@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greg Smith wrote:
> Bruce Momjian wrote:
>> Right, so what is the risk of shipping without any fancy monitoring?
>>
>
> You can monitor the code right now by watching the output shown in the
> ps display and by trolling the database logs. If I had to I could build
> a whole monitoring system out of those components, it would just be very
> fragile. I'd rather see one or two very basic bits of internals exposed
> beyond those to reduce that effort. I think it's a stretch to say that
> request represents a design change; a couple of UDFs to expose some
> internals is all I think it would take to dramatically drop the amount
> of process/log scraping required here to support a SR system.

so is there an actually concrete proposal of _what_ interals to expose?

>
> I guess the slightly more ambitious performance monitoring bits that
> Simon was suggesting may cross the line as being too late to implement
> now though (depends on how productive the people actually coding on this
> are I guess), and certainly the ideas thrown out for implementing any
> smart behavior or alerting when replication goes bad like Josh's
> "archiving_lag_action" seem based the deadline to get addressed
> now--even though I agree with the basic idea.

I'm not convinced that embedding actual alerting functionality in the
database is a good idea. Any reasonable production deployment is
probably using a dedicated monitoring and alerting system that is
aggregating and qualifying all monitoring results (as wel as proper
ratelimiting and stuff) that just needs a way to read in basic data.
Initially something like archiving_lag_action sounds like an invitation
to do a send_mail_to_admin() thingy which is really the wrong way to
approach monitoring in large scale environments...
The database needs to prove very basic information like "we are 10min
behind in replication" or "3 wal files behind" - the decision if any of
that is an actual issue or not should be left to the actual monitoring
system.

Stefan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2010-01-12 23:00:03 Re: Streaming replication status
Previous Message Joshua D. Drake 2010-01-12 22:47:46 Re: Streaming replication status