Re: Resetting spilled txn statistics in pg_stat_replication

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Resetting spilled txn statistics in pg_stat_replication
Date: 2020-07-06 11:45:41
Message-ID: CAA4eK1+fHJXaJdSP5fCT_JGiH96U__gzsuWocNJxAAXZ2MDMaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 2, 2020 at 9:01 AM Masahiko Sawada
<masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
>
> I've attached PoC patch that implements a simple approach. I'd like to
> discuss how we collect the replication slot statistics in the stats
> collector before I bring the patch to completion.
>
> In this PoC patch, we have the array of PgStat_ReplSlotStats struct
> which has max_replication_slots entries. The backend and wal sender
> send the slot statistics to the stats collector when decoding a commit
> WAL record.
>
> typedef struct PgStat_ReplSlotStats
> {
> char slotname[NAMEDATALEN];
> PgStat_Counter spill_txns;
> PgStat_Counter spill_count;
> PgStat_Counter spill_bytes;
> } PgStat_ReplSlotStats;
>
> What I'd like to discuss are:
>
> Since the unique identifier of replication slots is the name, the
> process sends slot statistics along with slot name to stats collector.
> I'm concerned about the amount of data sent to the stats collector and
> the cost of searching the statistics within the statistics array (it’s
> O(N) where N is max_replication_slots). Since the maximum length of
> slot name is NAMEDATALEN (64 bytes default) and max_replication_slots
> is unlikely a large number, I might be too worrying but it seems like
> it’s better to avoid that if we can do that easily. An idea I came up
> with is to use the index of slots (i.g., the index of
> ReplicationSlotCtl->replication_slots[]) as the index of statistics of
> slot in the stats collector. But since the index of slots could change
> after the restart we need to synchronize the index of slots on both
> array somehow. So I thought that we can determine the index of the
> statistics of slots at ReplicationSlotAcquire() or
> ReplicationSlotCreate(), but it will in turn need to read stats file
> while holding ReplicationSlotControlLock to prevent the same index of
> the statistics being used by the concurrent process who creating a
> slot. I might be missing something though.
>

I don't think we should be bothered about the large values of
max_replication_slots. The default value is 10 and I am not sure if
users will be able to select values large enough that we should bother
about searching them by name. I think if it could turn out to be a
problem then we can try to invent something to mitigate it.

> Second, as long as the unique identifier is the slot name there is no
> convenient way to distinguish between the same name old and new
> replication slots, so the backend process or wal sender process sends
> a message to the stats collector to drop the replication slot at
> ReplicationSlotDropPtr(). This strategy differs from what we do for
> table, index, and function statistics. It might not be a problem but
> I’m thinking a better way.
>

Can we rely on message ordering in the transmission mechanism (UDP)
for stats? The wiki suggests [1] we can't. If so, then this might
not work.

> The new view name is also an open question. I prefer
> pg_stat_replication_slots and to add stats of physical replication
> slots to the same view in the future, rather than a separate view.
>

This sounds okay to me.

> Aside from the above, this patch will change the most of the changes
> introduced by commit 9290ad198b1 and introduce new code much. I’m
> concerned whether such changes are acceptable at the time of beta 2.
>

I think it depends on the final patch. My initial thought was that we
should do this for PG14 but if you are suggesting removing the changes
done by commit 9290ad198b1 then we need to think over it. I could
think of below options:
a. Revert 9290ad198b1 and introduce stats for spilling in PG14. We
were anyway having spilling without any work in PG13 but didn’t have
stats.
b. Try to get your patch in PG13 if we can, otherwise, revert the
feature 9290ad198b1.
c. Get whatever we have in commit 9290ad198b1 for PG13 and
additionally have what we are discussing here for PG14. This means
that spilled stats at slot level will be available in PG14 via
pg_stat_replication_slots and for individual WAL senders it will be
available via pg_stat_replication both in PG13 and PG14. Even if we
can get your patch in PG13, we can still keep those in
pg_stat_replication.
d. Get whatever we have in commit 9290ad198b1 for PG13 and change it
for PG14. I don't think this will be a popular approach.

[1] - https://en.wikipedia.org/wiki/User_Datagram_Protocol

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2020-07-06 11:49:43 Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Previous Message Anastasia Lubennikova 2020-07-06 10:45:52 Proposal: Automatic partition creation