Re: Resetting spilled txn statistics in pg_stat_replication

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Resetting spilled txn statistics in pg_stat_replication
Date: 2020-09-08 13:32:47
Message-ID: CAA4eK1LFkeSA=axwdJ1iv8seZVNddP331znjgxN1vCvG8YjoYA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 8, 2020 at 7:53 AM Masahiko Sawada
<masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
>
> On Mon, 7 Sep 2020 at 15:24, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > This patch needs a rebase. I don't see this patch in the CF app. I
> > hope you are still interested in working on this.
>
> Thank you for reviewing this patch!
>
> I'm still going to work on this patch although I might be slow
> response this month.
>

Comments on the latest patch:
=============================
1.
+CREATE VIEW pg_stat_replication_slots AS
+ SELECT
+ s.name,
+ s.spill_txns,
+ s.spill_count,
+ s.spill_bytes,
+ s.stats_reset
+ FROM pg_stat_get_replication_slots() AS s;

You forgot to update the docs for the new parameter.

2.
@@ -5187,6 +5305,12 @@ pgstat_read_statsfiles(Oid onlydb, bool
permanent, bool deep)
for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;

+ /*
+ * Set the same reset timestamp for all replication slots too.
+ */
+ for (i = 0; i < max_replication_slots; i++)
+ replSlotStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+

I don't understand why you have removed the above code from the new
version of the patch?

3.
pgstat_recv_resetreplslotcounter()
{
..
+ ts = GetCurrentTimestamp();
+ for (i = 0; i < nReplSlotStats; i++)
+ {
+ /* reset entry with the given index, or all entries */
+ if (msg->clearall || idx == i)
+ {
+ /* reset only counters. Don't clear slot name */
+ replSlotStats[i].spill_txns = 0;
+ replSlotStats[i].spill_count = 0;
+ replSlotStats[i].spill_bytes = 0;
+ replSlotStats[i].stat_reset_timestamp = ts;
+ }
+ }
..

I don't like this coding pattern as in the worst case we need to
traverse all the slots to reset a particular slot. This could be okay
for a fixed number of elements as we have in SLRU but here it appears
quite inefficient. We can move the reset of stats part to a separate
function and then invoke it from the place where we need to reset a
particular slot and the above place.

4.
+pgstat_replslot_index(const char *name, bool create_it)
{
..
+ replSlotStats[nReplSlotStats].stat_reset_timestamp = GetCurrentTimestamp();
..
}

Why do we need to set the reset timestamp on the creation of slot entry?

5.
@@ -3170,6 +3175,13 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
spilled++;
}

+ /* update the statistics */
+ rb->spillCount += 1;
+ rb->spillBytes += size;
+
+ /* Don't consider already serialized transactions. */
+ rb->spillTxns += rbtxn_is_serialized(txn) ? 0 : 1;

We can't increment the spillTxns in the above way because now
sometimes we do serialize before streaming and in that case we clear
the serialized flag after streaming, see ReorderBufferTruncateTXN. So,
the count can go wrong. Another problem is currently the patch call
UpdateSpillStats only from begin_cb_wrapper which means it won't
consider streaming transactions (streaming transactions that might
have spilled). If we consider the streamed case along with it, we can
probably keep this counter up-to-date because in the above place we
can check if the txn is once serialized or streamed, we shouldn't
increment the counter. I think we need to merge Ajin's patch for
streaming stats [1] and fix the issue. I have not checked his patch so
it might need a rebase and or some changes.

6.
@@ -322,6 +321,9 @@ ReplicationSlotCreate(const char *name, bool db_specific,

/* Let everybody know we've modified this slot */
ConditionVariableBroadcast(&slot->active_cv);
+
+ /* Create statistics entry for the new slot */
+ pgstat_report_replslot(NameStr(slot->data.name), 0, 0, 0);
}
..
..
@@ -683,6 +685,18 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
ereport(WARNING,
(errmsg("could not remove directory \"%s\"", tmppath)));

+ /*
+ * Report to drop the replication slot to stats collector. Since there
+ * is no guarantee the order of message arrival on an UDP connection,
+ * it's possible that a message for creating a new slot arrives before a
+ * message for removing the old slot. We send the drop message while
+ * holding ReplicationSlotAllocationLock to reduce that possibility.
+ * If the messages arrived in reverse, we would lose one statistics update
+ * message. But the next update message will create the statistics for
+ * the replication slot.
+ */
+ pgstat_report_replslot_drop(NameStr(slot->data.name));
+

Similar to drop message, why don't we send the create message while
holding the ReplicationSlotAllocationLock?

[1] - https://www.postgresql.org/message-id/CAFPTHDZ8RnOovefzB%2BOMoRxLSD404WRLqWBUHe6bWqM5ew1bNA%40mail.gmail.com

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2020-09-08 13:32:52 Re: Inconsistency in determining the timestamp of the db statfile.
Previous Message Ashutosh Bapat 2020-09-08 13:20:08 Re: Transactions involving multiple postgres foreign servers, take 2