Re: Resetting spilled txn statistics in pg_stat_replication

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Resetting spilled txn statistics in pg_stat_replication
Date: 2020-09-07 06:24:23
Message-ID: CAA4eK1JBqQh9cBKjO-nKOOE=7f6ONDCZp0TJZfn4VsQqRZ+uYA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 23, 2020 at 11:46 AM Masahiko Sawada
<masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
>
> I've updated the patch so that the stats collector ignores the
> 'update' message if the slot stats array is already full.
>

This patch needs a rebase. I don't see this patch in the CF app. I
hope you are still interested in working on this.

Review comments:
===============
1.
+CREATE VIEW pg_stat_replication_slots AS
+ SELECT
+ s.name,
+ s.spill_txns,
+ s.spill_count,
+ s.spill_bytes
+ FROM pg_stat_get_replication_slots() AS s;

The view pg_stat_replication_slots should have a column 'stats_reset'
(datatype: timestamp with time zone) as we provide a facitily to reset
the slots. A similar column exists in pg_stat_slru as well, so is
there a reason of not providing it here?

2.
+ </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-wal-receiver-view">
<title><structname>pg_stat_wal_receiver</structname></title>

It is better to keep one empty line between </para> and </sect2> to
keep it consistent with the documentation of other views.

3.
<primary>pg_stat_reset_replication_slot</primary>
+ </indexterm>
+ <function>pg_stat_reset_replication_slot</function> (
<type>text</type> )
+ <returnvalue>void</returnvalue>
+ </para>
+ <para>
+ Resets statistics to zero for a single replication slot, or for all
+ replication slots in the cluster. If the argument is NULL,
all counters
+ shown in the
<structname>pg_stat_replication_slots</structname> view for
+ all replication slots are reset.
+ </para>

I think the information about the parameter for this function is not
completely clear. It seems to me that it should be the name of the
slot for which we want to reset the stats, if so, let's try to be
clear.

4.
+pgstat_reset_replslot_counter(const char *name)
+{
+ PgStat_MsgResetreplslotcounter msg;
+
+ if (pgStatSock == PGINVALID_SOCKET)
+ return;
+
+ pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
+ if (name)
+ {
+ memcpy(&msg.m_slotname, name, NAMEDATALEN);
+ msg.clearall = false;
+ }

Don't we want to verify here or in the caller of this function whether
the passed slot_name is a valid slot or not? For ex. see
pgstat_reset_shared_counters where we return an error if the target is
not valid.

5.
+static void
+pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
+ int len)
+{
+ int i;
+ int idx = -1;
+ TimestampTz ts;
+
+ if (!msg->clearall)
+ {
+ /* Get the index of replication slot statistics to reset */
+ idx = pgstat_replslot_index(msg->m_slotname, false);
+
+ if (idx < 0)
+ return; /* not found */

Can we add a comment to describe when we don't expect to find the slot
here unless there is no way that can happen?

6.
+pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
+ int len)
{
..
+ for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
..
}

I think here we need to traverse till nReplSlotStats, not SLRU_NUM_ELEMENTS.

7. Don't we need to change PGSTAT_FILE_FORMAT_ID for this patch? We
can probably do at the end but better to change it now so that it
doesn't slip from our mind.

8.
@@ -5350,6 +5474,23 @@ pgstat_read_statsfiles(Oid onlydb, bool
permanent, bool deep)

break;

+ /*
+ * 'R' A PgStat_ReplSlotStats struct describing a replication slot
+ * follows.
+ */
+ case 'R':
+ if (fread(&replSlotStats[nReplSlotStats], 1,
sizeof(PgStat_ReplSlotStats), fpin)
+ != sizeof(PgStat_ReplSlotStats))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"",
+ statfile)));
+ memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
+ goto done;
+ }
+ nReplSlotStats++;
+ break;

Both here and in pgstat_read_db_statsfile_timestamp(), the patch
handles 'R' message after 'D' whereas while writing the 'R' is written
before 'D'. So, I think it is better if we keep the order during read
the same as during write.

9. While reviewing this patch, I noticed that in
pgstat_read_db_statsfile_timestamp(), if we fail to read ArchiverStats
or SLRUStats, we return 'false' from this function but OTOH, if we
fail to read 'D' or 'R' message, we will return 'true'. I feel the
handling of 'D' and 'R' message is fine because once we read
GlobalStats, we can return the stats_timestamp. So the other two
stands corrected. I understand that this is not directly related to
this patch but if you agree we can do this as a separate patch.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2020-09-07 06:29:02 Re: A micro-optimisation for walkdir()
Previous Message Michael Paquier 2020-09-07 06:00:20 Re: clarify "rewritten" in pg_checksums docs