Re: Resetting spilled txn statistics in pg_stat_replication

From: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Resetting spilled txn statistics in pg_stat_replication
Date: 2020-09-08 02:23:10
Message-ID: CA+fd4k7hOvdihZ-mwkzOURY_VGqA4jzMCX5d9HNSG93gQ8f0+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 7 Sep 2020 at 15:24, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Jul 23, 2020 at 11:46 AM Masahiko Sawada
> <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
> >
> > I've updated the patch so that the stats collector ignores the
> > 'update' message if the slot stats array is already full.
> >
>
> This patch needs a rebase. I don't see this patch in the CF app. I
> hope you are still interested in working on this.

Thank you for reviewing this patch!

I'm still going to work on this patch although I might be slow
response this month.

>
> Review comments:
> ===============
> 1.
> +CREATE VIEW pg_stat_replication_slots AS
> + SELECT
> + s.name,
> + s.spill_txns,
> + s.spill_count,
> + s.spill_bytes
> + FROM pg_stat_get_replication_slots() AS s;
>
> The view pg_stat_replication_slots should have a column 'stats_reset'
> (datatype: timestamp with time zone) as we provide a facitily to reset
> the slots. A similar column exists in pg_stat_slru as well, so is
> there a reason of not providing it here?

I had missed adding the column. Fixed.

>
> 2.
> + </para>
> + </sect2>
> +
> + <sect2 id="monitoring-pg-stat-wal-receiver-view">
> <title><structname>pg_stat_wal_receiver</structname></title>
>
> It is better to keep one empty line between </para> and </sect2> to
> keep it consistent with the documentation of other views.

Fixed.

>
> 3.
> <primary>pg_stat_reset_replication_slot</primary>
> + </indexterm>
> + <function>pg_stat_reset_replication_slot</function> (
> <type>text</type> )
> + <returnvalue>void</returnvalue>
> + </para>
> + <para>
> + Resets statistics to zero for a single replication slot, or for all
> + replication slots in the cluster. If the argument is NULL,
> all counters
> + shown in the
> <structname>pg_stat_replication_slots</structname> view for
> + all replication slots are reset.
> + </para>
>
> I think the information about the parameter for this function is not
> completely clear. It seems to me that it should be the name of the
> slot for which we want to reset the stats, if so, let's try to be
> clear.

Fixed.

>
> 4.
> +pgstat_reset_replslot_counter(const char *name)
> +{
> + PgStat_MsgResetreplslotcounter msg;
> +
> + if (pgStatSock == PGINVALID_SOCKET)
> + return;
> +
> + pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
> + if (name)
> + {
> + memcpy(&msg.m_slotname, name, NAMEDATALEN);
> + msg.clearall = false;
> + }
>
> Don't we want to verify here or in the caller of this function whether
> the passed slot_name is a valid slot or not? For ex. see
> pgstat_reset_shared_counters where we return an error if the target is
> not valid.

Agreed. Fixed.

>
> 5.
> +static void
> +pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
> + int len)
> +{
> + int i;
> + int idx = -1;
> + TimestampTz ts;
> +
> + if (!msg->clearall)
> + {
> + /* Get the index of replication slot statistics to reset */
> + idx = pgstat_replslot_index(msg->m_slotname, false);
> +
> + if (idx < 0)
> + return; /* not found */
>
> Can we add a comment to describe when we don't expect to find the slot
> here unless there is no way that can happen?

Added.

>
> 6.
> +pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
> + int len)
> {
> ..
> + for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
> ..
> }
>
> I think here we need to traverse till nReplSlotStats, not SLRU_NUM_ELEMENTS.

Fixed.

>
> 7. Don't we need to change PGSTAT_FILE_FORMAT_ID for this patch? We
> can probably do at the end but better to change it now so that it
> doesn't slip from our mind.

Yes, changed.

>
> 8.
> @@ -5350,6 +5474,23 @@ pgstat_read_statsfiles(Oid onlydb, bool
> permanent, bool deep)
>
> break;
>
> + /*
> + * 'R' A PgStat_ReplSlotStats struct describing a replication slot
> + * follows.
> + */
> + case 'R':
> + if (fread(&replSlotStats[nReplSlotStats], 1,
> sizeof(PgStat_ReplSlotStats), fpin)
> + != sizeof(PgStat_ReplSlotStats))
> + {
> + ereport(pgStatRunningInCollector ? LOG : WARNING,
> + (errmsg("corrupted statistics file \"%s\"",
> + statfile)));
> + memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
> + goto done;
> + }
> + nReplSlotStats++;
> + break;
>
> Both here and in pgstat_read_db_statsfile_timestamp(), the patch
> handles 'R' message after 'D' whereas while writing the 'R' is written
> before 'D'. So, I think it is better if we keep the order during read
> the same as during write.

Changed the code so that it writes 'R' after 'D'.

>
> 9. While reviewing this patch, I noticed that in
> pgstat_read_db_statsfile_timestamp(), if we fail to read ArchiverStats
> or SLRUStats, we return 'false' from this function but OTOH, if we
> fail to read 'D' or 'R' message, we will return 'true'. I feel the
> handling of 'D' and 'R' message is fine because once we read
> GlobalStats, we can return the stats_timestamp. So the other two
> stands corrected. I understand that this is not directly related to
> this patch but if you agree we can do this as a separate patch.

It seems to make sense to me. We can set *ts and then read both
ArchiverStats and SLRUStats so we can return a valid timestamp even if
we fail to read.

I've attached both patches: 0001 patch fixes the issue you reported.
0002 patch is the patch that incorporated all review comments.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0002-Track-statistics-for-spilling-of-changes-from-Reorde.patch application/octet-stream 35.6 KB
0001-Return-true-when-failing-to-read-SLRU-or-Archiver-st.patch application/octet-stream 1.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2020-09-08 02:27:11 Re: [PATCH] Detect escape of ErrorContextCallback stack pointers (and from PG_TRY() )
Previous Message Junfeng Yang 2020-09-08 01:53:26 回复: Is it possible to set end-of-data marker for COPY statement.