Re: Add session statistics to pg_stat_database

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Cc: Ahsan Hadi <ahsan(dot)hadi(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add session statistics to pg_stat_database
Date: 2020-12-01 16:32:21
Message-ID: CABUevEzx9bFWkarv7rc+CCZemG-77y0wHOn8fg3e-OePVS=D-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 20, 2020 at 3:41 PM Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
wrote:

> On Tue, 2020-11-17 at 17:33 +0100, Magnus Hagander wrote:
> > I've taken a look as well, and here are a few short notes:
>
> Much appreciated!
>

Sorry about the delay in getting back to you on this one. FYI, while the
patch has been bumped to the next CF by now, I do intend to continue
working on it before that starts.

> * It talks about "number of connections" but "number of aborted
> sessions". We should probably
> > be consistent about talking either about connections or sessions? In
> particular, connections
> > seems wrong in this case, because it only starts counting after
> authentication is complete
> > (since otherwise we send no stats)? (This goes for both docs and
> actual function names)
>
> Yes, that is true. I have changed "connections" to "sessions" and renamed
> the new
> column "connections" to "session_count".
>
> I think that most people will understand a session as started after a
> successful
> connection.
>

Yeah, I agree, and as long as it's consistent we don't need more
explanations than that.

Further int he views, it's a bit strange to have session_count and
aborted_session, but I'm not sure what to suggest. "aborted_session_count"
seems too long. Maybe just "sessions" instead of "session_count" -- no
other counters actually have the "_count" suffix.

> * Is this actually a fix that's independent of the new stats? It seems in
> general to be
> > changing the behaviour of "force", which is more generic?
> > - !have_function_stats)
> > + !have_function_stats && !force)
>
> The comment right above that reads:
> /* Don't expend a clock check if nothing to do */
> So it is just a quick exit if there is nothing to do.
>
> But with that patch we have something to do if "force" (see below) is true:
> Report the remaining session duration and if the session was closed
> normally.
>
> Thus the additional check.
>

Ah yeah, makes sense. It becomes more clear with the rename.

> * in pgstat_send_connstat() you pass the parameter "force" in as
> "disconnect".
> > That behaviour at least requires a comment saying why, I think. My
> understanding is
> > it relies on that "force" means this is a "backend is shutting down",
> but that is not
> > actually documented anywhere. Maybe the "force" parameter should
> actually be renamed
> > to indicate this is really what it means, to avoid a future mistake in
> the area?
> > But even with that, how does that turn into disconnect?
>
> "pgstat_report_stat(true)" is only called from "pgstat_beshutdown_hook()",
> so
> it is currently only called when the backend is about to exit.
>
> According the the comments the flag means that "caller wants to force
> stats out".
> I guess that the author thought that there may arise other reasons to
> force sending
> statistics in the future (commit 641912b4d from 2007).
>
> However, since that has not happened, I have renamed the flag to
> "disconnect" and
> adapted the documentation. This doesn't change the current behavior, but
> establishes
> a new rule.
>

That makes it a lot more clear. And I agree, if nobody came up with a
reason since 2007, then we are free to repurpose it :)

> * Maybe rename pgStatSessionDisconnected to
> pgStatSessionNormalDisconnected?
> > To avoid having to go back to the setting point and look it up in a
> comment.
>
> Long, descriptive names are a good thing.
> I have decided to use "pgStatSessionDisconnectedNormally", since that is
> even longer
> and seems to fit the "yes or no" category better.
>

WFM.

> I wonder if there would also be a way to count "sessions that crashed" as
> well.
> > That is,the ones that failed in a way that caused the postmaster to
> restart the system.
> > But that's information we'd have to send from the postmaster, but I'm
> actually unsure
> > if we're "allowed" to send things to the stats collector from the
> postmaster.
> > But I think it could be quite useful information to have. Maybe we can
> find some way
> > to piggyback on the fact that we're restarting the stats collector as a
> result?
>
> Sure, a crash count would be useful. I don't know if it is easy for the
> stats collector
> to tell the difference between a start after a backend crash and - say -
> starting from
> a base backup.
>
> Patch v6 attached.
>
> I think that that would be material for another patch, and I don't think
> it should go
> to "pg_stat_database", because a) it might be hard to tell to which
> database the crashed
> backend was attached, b) it might be a background process that doesn't
> belong to a database
> and c) if the crash were caused by - say - corruption in a shared catalog,
> it would be
> misleading

I'm not sure it is outside the scope of this patch, because I think it
might be easier to do than I (and I think you) first thought. We don't need
to track which database crashed -- if we track all *other* ways a database
exits, then crashes are all that remains.

So in fact, we *almost* have all the data we need already. We have the
number of sessions started. We have the number of sessions "aborted". if we
also had the number of sessions that were closed normally, then whatever is
"left" would be the number of sessions crashed. And we do already, in your
patch, send the message in the case of both aborted and non-aborted
sessions. So we just need to keep track of both in the statsfile (which we
don't now), and we'd more or less have it, wouldn't we?

However, some thinking around that also leads me to another question which
is very much in scope for this patch regardless, which is what about
shutdown and admin termination. Right now, when you do a "pg_ctl stop" on
the database, all sessions count as aborted. Same thing for a
pg_terminate_backend(). I wonder if this is also a case that would be
useful to track as a separate thing? One could argue that the docs in your
patch say aborted means "terminated by something else than a regular client
disconnection". But that's true for a "shutdown", but not for a crash, so
whichever way we go with crashes it's slightly incorrect.

But thinking from a usability perspective, wouldn't what we want more be
something like <closed by correct disconnect>, <closed by abnormal
disconnect>, <closed by admin>, <crash>?

What do you think of adapting it to that?

Basically, that would change pgStatSessionDisconnectedNormally into instead
being an enum of reasons, which could be normal disconnect, abnormal
disconnect and admin. And we'd track all those three as separate numbers in
the stats file, meaning we could then calculate the crash by subtracting
all three from the total number of sessions?

(Let me know if you think the idea could work and would prefer it if I
worked up a complete suggestion based on it rather than just spitting ideas)

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chapman Flack 2020-12-01 16:34:19 Re: Confusing behavior of psql's \e
Previous Message Laurenz Albe 2020-12-01 16:21:11 Re: Confusing behavior of psql's \e