RE: Failed transaction statistics to measure the logical replication progress

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'Masahiko Sawada' <sawada(dot)mshk(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Failed transaction statistics to measure the logical replication progress
Date: 2021-08-02 05:52:27
Message-ID: OSBPR01MB4888D9048D3D7E7C3D035D54EDEF9@OSBPR01MB4888.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, July 29, 2021 10:50 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> On Thu, Jul 8, 2021 at 3:55 PM osumi(dot)takamichi(at)fujitsu(dot)com
> <osumi(dot)takamichi(at)fujitsu(dot)com> wrote:
> > When the current HEAD fails during logical decoding, the failure
> > increments txns count in pg_stat_replication_slots - [1] and adds
> > the transaction size to the sum of bytes in the same repeatedly on
> > the publisher, until the problem is solved.
> > One of the good examples is duplication error on the subscriber side
> > and this applies to both streaming and spill cases as well.
> >
> > This update prevents users from grasping the exact number and size
> > of successful and unsuccessful transactions. Accordingly, we need to
> > have new columns of failed transactions that will work to
> > differentiate both of them for all types, which means spill,
> > streaming and normal transactions. This will help users to measure
> > the exact status of logical replication.
>
> Could you please elaborate on use cases of the proposed statistics?
> For example, the current statistics on pg_replication_slots can be
> used for tuning logical_decoding_work_mem as well as inferring the
> total amount of bytes passed to the output plugin. How will the user use those statistics?
>
> Also, if we want the stats of successful transactions why don't we
> show the stats of successful transactions in the view instead of ones
> of failed transactions?
It works to show the ratio of successful and unsuccessful transactions,
which should be helpful in terms of administrator perspective.
FYI, the POC patch added the columns where I prefixed 'failed' to those names.
But, substantially, it meant the ratio when user compared normal columns and
newly introduced columns by this POC in the pg_stat_replication_slots.

> > Attached file is the POC patch for this.
> > Current design is to save failed stats data in the ReplicationSlot struct.
> > This is because after the error, I'm not able to access the
> > ReorderBuffer
> object.
> > Thus, I chose the object where I can interact with at the
> ReplicationSlotRelease timing.
>
> When discussing the pg_stat_replication_slots view, there was an idea
> to store the slot statistics on ReplicationSlot struct. But the idea
> was rejected mainly because the struct is on the shared buffer[1]. If
> we store those counts on ReplicationSlot struct it increases the usage
> of shared memory. And those statistics are used only by logical slots
> and don’t necessarily need to be shared among the server processes.
Yes, I was aware of this.
I was not sure if this design will be expected or not for the enhancement,
I thought of changing the design accordingly once the idea gets accepted by the community.

> Moreover, if we want to add more statistics on the view in the future,
> it further increases the usage of shared memory. If we want to track
> the stats of successful transactions, I think it's easier to track
> them on the subscriber side rather than the publisher side. We can
> increase counters when applying [stream]commit/abort logical changes on the subscriber.
It's true that to track the stats of successful and unsuccessful transactions on the *sub*
is easier than on the pub. After some survey, it turned out that I cannot distinguish
the protocol messages between the cases of any failure (e.g. duplication error on the sub)
from user intentional and successful operations(e.g. DROP SUBSCRIPTION and ALTER SUBSCRIPTION DISABLE) on the pub.

If we truly want to achieve this change on the publisher side,
protocol change requires in order to make above cases distinguishable,
now I feel that it is better to do this in the subscriber side.

Accordingly, I'm thinking to have unsuccessful and successful stats on the sub side.
Sawada-san is now implementing a new view in [1].
Do you think that I should write a patch to introduce a new separate view
or write a patch to add more columns to the new view "pg_stat_subscription_errors" that is added at [1] ?

[1] - https://www.postgresql.org/message-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK%3D30xJfUVihNZDA%40mail.gmail.com

Best Regards,
Takamichi Osumi

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Nancarrow 2021-08-02 06:04:18 Re: Parallel Inserts (WAS: [bug?] Missed parallel safety checks..)
Previous Message Thomas Munro 2021-08-02 05:36:23 Re: Background writer and checkpointer in crash recovery