Re: Report bytes and transactions actually sent downtream

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Report bytes and transactions actually sent downtream
Date: 2025-07-01 14:05:18
Message-ID: CAExHW5ups+Hyb9jPwmyAUt=WcuzbZpx-3PgLEQeF4tF8gtXWsQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 1, 2025 at 4:23 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Jun 30, 2025 at 3:24 PM Ashutosh Bapat
> <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> >
> > Hi All,
> > In a recent logical replication issue, there were multiple replication
> > slots involved, each using a different publication. Thus the amount of
> > data that was replicated through each slot was expected to be
> > different. However, total_bytes and total_txns were reported the same
> > for all the replication slots as expected. One of the slots started
> > lagging and we were trying to figure out whether its the WAL sender
> > slowing down or the consumer (in this case Debezium). The lagging
> > slot then showed total_txns and total_bytes lesser than other slots
> > giving an impression that the WAL sender is processing the data
> > slowly. Had pg_stat_replication_slot reported the amount of data
> > actually sent downstream, it would have been easier to compare it with
> > the amount of data received by the consumer and thus pinpoint the
> > bottleneck.
> >
> > Here's a patch to do the same. It adds two columns
> > - sent_txns: The total number of transactions sent downstream.
> > - sent_bytes: The total number of bytes sent downstream in data messages
> > to pg_stat_replication_slots. sent_bytes includes only the bytes sent
> > as part of 'd' messages and does not include keep alive messages or
> > CopyDone messages for example. But those are very few and can be
> > ignored. If others feel that those are important to be included, we
> > can make that change.
> >
> > Plugins may choose not to send an empty transaction downstream. It's
> > better to increment sent_txns counter in the plugin code when it
> > actually sends a BEGIN message, for example in pgoutput_send_begin()
> > and pg_output_begin(). This means that every plugin will need to be
> > modified to increment the counter for it to reported correctly.
> >
>
> What if some plugin didn't implemented it or does it incorrectly?
> Users will then complain that PG view is showing incorrect value.

That is right.

To fix the problem of plugins not implementing the counter increment
logic we could use logic similar to how we track whether
OutputPluginPrepareWrite() has been called or not. In
ReorderBufferTxn, we add a new member sent_status which would be an
enum with 3 values UNKNOWN, SENT, NOT_SENT. Initially the sent_status
= UNKNOWN. We provide a function called
plugin_sent_txn(ReorderBufferTxn txn, sent bool) which will set
sent_status = SENT when sent = true and sent_status = NOT_SENT when
sent = false. In all the end transaction callback wrappers like
commit_cb_wrapper(), prepare_cb_wrapper(), stream_abort_cb_wrapper(),
stream_commit_cb_wrapper() and stream_prepare_cb_wrapper(), if
tsent_status = UNKNOWN, we throw an error. If sent_status = SENT, we
increment sent_txns. That will catch any plugin which does not call
plugin_set_txn(). The plugin may still call plugin_sent_txn() with
sent = true when it should have called it with sent = false or vice
versa, but that's hard to monitor and control.

Additionally, we should highlight in the document that sent_txns is as
per report from the output plugin so that users know where to look
for in case they see a wrong/dubious value. I see this similar to what
we do with pg_stat_replication::reply_time which may be wrong if a
non-PG standby reports the wrong value. Documentation says "Send time
of last reply message received from standby server", so the users know
where to look for incase they spot the error.

Does that look good?

I am open to other suggestions.

> Shouldn't the plugin specific stats be shown differently, for example,
> one may be interested in how much plugin has filtered the data because
> it was not published or because something like row_filter caused it
> skip sending such data?
>

That looks useful, we could track the ReorderBufferChange's that were
not sent downstream and add their sizes to another counter
ReorderBuffer::filtered_bytes and report it in
pg_stat_replication_slots. I think we will need to devise a mechanism
similar to above by which the plugin tells core whether a change has
been filtered or not. However, that will not be a replacement for
sent_bytes, since filtered_bytes or total_bytes - filtered_bytes won't
tell us how much data was sent downstream, which is crucial to the
purpose stated in my earlier email.

--
Best Wishes,
Ashutosh Bapat

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-07-01 14:06:55 Re: Optimize LWLock scalability via ReadBiasedLWLock for heavily-shared locks
Previous Message Tomas Vondra 2025-07-01 14:02:39 Re: NUMA shared memory interleaving