Re: Function to track shmem reinit time

From: Grigory Smolkin <g(dot)smolkin(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Function to track shmem reinit time
Date: 2018-03-12 18:51:38
Message-ID: 6f955229-f46e-c4a1-beb9-535236b67ca6@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 03/03/2018 09:00 PM, Peter Eisentraut wrote:
> I find this premise a bit dubious. Why have a log file if it's too big
> to find anything in it? Server crashes aren't the only thing people are
> interested in. So we'll need a function for "last $anything".

Thank you for your interest in this topic.
Well, on heavy loaded machine log file will be big, because you usually
want to log every query for later analysis, and, because postgres is
dumping everything in a single file, searching for some specific error
will be slow. Log file is certainly needed for digging the details of a
crash, but pg_shmem_init_time is more about online monitoring.

On 03/03/2018 09:43 PM, Justin Pryzby wrote:

> I think one can tell if it's crashed recently by comparing start time of parent
> postmaster and its main children (I'm going to go put this in place for myself
> now).
>
> ts=# SELECT backend_type, backend_start, pg_postmaster_start_time(), backend_start-pg_postmaster_start_time() FROM pg_stat_activity ORDER BY backend_start LIMIT 1;
> backend_type | backend_start | pg_postmaster_start_time | ?column?
> ---------------------+-------------------------------+-------------------------------+-----------------
> autovacuum launcher | 2018-03-02 00:21:11.604468-03 | 2018-03-02 00:12:46.757642-03 | 00:08:24.846826

Thank you, though we must take into account the fact that autovacuum may
not be enabled so it`s better to use checkpointer or bgwriter. It`s
working solution, yes, but it looks more like workaround and requires
from user an understanding of postgres internals.

On 03/04/2018 06:56 PM, Tomas Vondra wrote:

> Can you please explain why pg_postmaster_start_time can't be used for
> this purpose? It seems like a pretty good match, considering it's meant
> to show server start time.

Because a backend crash do not reset pg_postmaster_start_time.

On 03/04/2018 07:02 PM, Tomas Vondra wrote:
> On 02/28/2018 01:11 PM, Anastasia Lubennikova wrote:
>> Attached patch introduces a new function pg_shmem_init_time(),
>> which returns the time shared memory was last (re)initialized.
>> It is created for use by monitoring tools to track backend crashes.
>>
>> Currently, if the 'restart_after_crash' option is on, postgres will
>> just restart. And the only way to know that it happened is to
>> regularly parse logfile or monitor it, catching restart messages.
>> This approach is really inconvenient for users, who have gigabytes of
>> logs.
>>
>> This new function can be periodiacally called by a monitoring agent,
>> and, if /shmem_init_time/ doesn't match /pg_postmaster_start_time,/
>> we know that server crashed-restarted, and also know the exact time,
>> when.
>>
> I don't think it really solves the problem, though. For example if the
> whole VM reboots (which can be a matter of seconds), this check will say
> "shmem_init_time == pg_postmaster_start_time" and you've not detected
> anything.
>
> IMHO pg_postmaster_start_time is the right way to monitor uptime, and
> the right way to detect spurious restarts is to remember the last value
> you've seen and compare it to the current one.

Yes, for the described case with VM restart pg_postmaster_start_time
works fine.
pg_postmaster_start_time is essential for almost every case and
pg_shmem_init_time only expand his usefulness, not diminish.

On 03/04/2018 07:09 PM, Tom Lane wrote:
> It evidently depends on how you want to define "server uptime". If you
> get backend crashes often enough, you might feel a need to define it
> as "time since last crash". Although I would think that if that's
> happening regularly in a production environment, you have a problem
> you need to fix, not just measure.

Absolutely. And for that you need to know ASAP that backend crash
happened in the first place.

On 03/04/2018 07:09 PM, Tom Lane wrote:
> My own thought about that is that if you are trying to measure
> backend crashes, just knowing the time of the latest one is little help.
> You want to know how often they're happening. So this gets back to the
> question of why the postmaster log isn't a useful source of that
> information.

For that purpose pg_shmem_init_time also can be used.
For example we can build a chart based on values from "select (now() -
pg_shmem_init_time());" taken, for example, every 10 seconds.
Values around 0 =< x =<10 will signal about memory reinitialization
which is usually a byproduct of a backend crash.

On 03/04/2018 07:09 PM, Tom Lane wrote:
> I think that if we're to do anything in this area,
> improving the usefulness of the log would be more important than
> providing the proposed function.

The separation of maintenance messages and query messaged into different
log files is sorely needed.
This way server errors can be identified fast and in convenient way.
But as I mentioned earlier, pg_shmem_init_time() is about online monitoring.

On 03/04/2018 07:02 PM, Tomas Vondra wrote:
> Actually, after looking at the code a bit, I think that test would not
> really work anyway, because those two timestamps come from two separate
> GetCurrentTimestamp calls, so you get this:
>
> test=# select pg_shmem_init_time(), pg_postmaster_start_time();
> pg_shmem_init_time | pg_postmaster_start_time
> -------------------------------+-------------------------------
> 2018-03-04 17:10:59.017058+01 | 2018-03-04 17:10:59.022071+01
> (1 row)
>
> test=# select pg_shmem_init_time() = pg_postmaster_start_time();
> ?column?
> ----------
> f
> (1 row)
>
> So there would have to be some sort of "fuzz" parameter when comparing
> the values, etc.

The difference measured in milliseconds is expected. After all, shared
memory is initialized after postmaster start.

--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2018-03-12 18:52:47 Re: Additional Statistics Hooks
Previous Message David Steele 2018-03-12 18:33:18 Re: PATCH: Unlogged tables re-initialization tests