Re: Failed transaction statistics to measure the logical replication progress

From: Greg Nancarrow <gregn4422(at)gmail(dot)com>
To: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Failed transaction statistics to measure the logical replication progress
Date: 2021-11-04 00:53:47
Message-ID: CAJcOf-dqYomG3h0HP0zkecjNCMFS85eYaubLiPPScibyrP_7Qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 2, 2021 at 12:18 AM osumi(dot)takamichi(at)fujitsu(dot)com
<osumi(dot)takamichi(at)fujitsu(dot)com> wrote:
>
> On Thursday, October 28, 2021 11:19 PM I wrote:
> > I've created a new patch that extends pg_stat_subscription_workers to include
> > other transaction statistics.
> >
> > Note that this patch depends on v18 patch-set in [1]...
> Rebased based on the v19 in [1].
> Also, fixed documentation a little and made tests tidy.
> FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable
> because I checked that 100 times of its execution in a tight loop all passed.
>

I have done some basic testing of the patch and have some initial
review comments:

(1) I think this patch needs to be in "git format-patch" format, with
a proper commit message that describes the purpose of the patch and
the functionality it adds, and any high-level design points (something
like the overview given in the initial post, and accounting for the
subsequent discussion points and updated functionality).

(2) doc/src/sgml/monitoring.sgml
There are some grammatical issues in the current description. I
suggest changing it to something like:

BEFORE:
+ <entry>At least one row per subscription, showing about
transaction statistics and error summary that
AFTER:
+ <entry>At least one row per subscription, showing transaction
statistics and information about errors that

(2) doc/src/sgml/monitoring.sgml
The current description seems a little confusing.
Per subscription, it shows the transaction statistics and any last
error info from tablesync/apply workers? If this is the case, I'd
suggest the following change:

BEFORE:
+ one row per subscription for transaction statistics and summary of the last
+ error reported by workers applying logical replication changes and workers
+ handling the initial data copy of the subscribed tables.
AFTER:
+ one row per subscription, showing corresponding transaction statistics and
+ information about the last error reported by workers applying
logical replication
+ changes or by workers handling the initial data copy of the
subscribed tables.

(3) xact_commit
I think that the "xact_commit" column should be named
"xact_commit_count" or "xact_commits".
Similarly, I think "xact_error" should be named "xact_error_count" or
"xact_errors", and "xact_aborts" should be named "xact_abort_count" or
"xact_aborts".

(4) xact_commit_bytes

+ Amount of transactions data successfully applied in this subscription.
+ Consumed memory for xact_commit is displayed.

I find this description a bit confusing. "Consumed memory for
xact_commit" seems different to "transactions data".
Could the description be something like: Amount of data (in bytes)
successfully applied in this subscription, across "xact_commit_count"
transactions.

(5)
I'd suggest some minor rewording for the following:

BEFORE:
+ Number of transactions failed to be applied and caught by table
+ sync worker or main apply worker in this subscription.
AFTER:
+ Number of transactions that failed to be applied by the table
+ sync worker or main apply worker in this subscription.

(6) xact_error_bytes
Again, it's a little confusing referring to "consumed memory" here.
How about rewording this, something like:

BEFORE:
+ Amount of transactions data unsuccessfully applied in this subscription.
+ Consumed memory that past failed transaction used is displayed.
AFTER:
+ Amount of data (in bytes) unsuccessfully applied in this
subscription by the last failed transaction.

(7)
The additional information provided for "xact_abort_bytes" needs some
rewording, something like:

BEFORE:
+ Increase <literal>logical_decoding_work_mem</literal> on the publisher
+ so that it exceeds the size of whole streamed transaction
+ to suppress unnecessary consumed network bandwidth in addition to change
+ in memory of the subscriber, if unexpected amount of streamed
transactions
+ are aborted.
AFTER:
+ In order to suppress unnecessary consumed network bandwidth, increase
+ <literal>logical_decoding_work_mem</literal> on the publisher so that it
+ exceeds the size of the whole streamed transaction, and
additionally increase
+ the available subscriber memory, if an unexpected amount of
streamed transactions
+ are aborted.

(8)
Suggested update:

BEFORE:
+ * Tell the collector that worker transaction has finished without problem.
AFTER:
+ * Tell the collector that the worker transaction has successfully completed.

(9) src/backend/postmaster/pgstat.c
I think that the GID copying is unnecessarily copying the whole GID
buffer or using an additional strlen().
It should be changed to use strlcpy() to match other code:

BEFORE:
+ /* get the gid for this two phase operation */
+ if (command == LOGICAL_REP_MSG_PREPARE ||
+ command == LOGICAL_REP_MSG_STREAM_PREPARE)
+ memcpy(msg.m_gid, prepare_data->gid, GIDSIZE);
+ else if (command == LOGICAL_REP_MSG_COMMIT_PREPARED)
+ memcpy(msg.m_gid, commit_data->gid, GIDSIZE);
+ else /* rollback prepared */
+ memcpy(msg.m_gid, rollback_data->gid, GIDSIZE);
AFTER:
+ /* get the gid for this two phase operation */
+ if (command == LOGICAL_REP_MSG_PREPARE ||
+ command == LOGICAL_REP_MSG_STREAM_PREPARE)
+ strlcpy(msg.m_gid, prepare_data->gid, sizeof(msg.m_gid));
+ else if (command == LOGICAL_REP_MSG_COMMIT_PREPARED)
+ strlcpy(msg.m_gid, commit_data->gid, sizeof(msg.m_gid));
+ else /* rollback prepared */
+ strlcpy(msg.m_gid, rollback_data->gid, sizeof(msg.m_gid));

BEFORE:
+ strlcpy(prepared_txn->gid, msg->m_gid, strlen(msg->m_gid) + 1);
AFTER:
+ strlcpy(prepared_txn->gid, msg->m_gid, sizeof(prepared_txn->gid));

BEFORE:
+ memcpy(key.gid, msg->m_gid, strlen(msg->m_gid));
AFTER:
+ strlcpy(key.gid, msg->m_gid, sizeof(key.gid));

BEFORE:
+ memcpy(key.gid, gid, strlen(gid));
AFTER:
+ strlcpy(key.gid, gid, sizeof(key.gid));

(10) src/backend/replication/logical/worker.c
Some suggested rewording:

BEFORE:
+ * size of streaming transaction resources because it have used the
AFTER:
+ * size of streaming transaction resources because it has used the

BEFORE:
+ * tradeoff should not be good. Also, add multiple values
+ * at once in order to reduce the number of this function call.
AFTER:
+ * tradeoff would not be good. Also, add multiple values
+ * at once in order to reduce the number of calls to this function.

(11) update_apply_change_size()
Shouldn't this function be declared static?

(12) stream_write_change()

+ streamed_entry->xact_size = streamed_entry->xact_size + total_len;
/* update */

could be simply written as:

+ streamed_entry->xact_size += total_len; /* update */

Regards,
Greg Nancarrow
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message houzj.fnst@fujitsu.com 2021-11-04 00:55:42 RE: Added schema level support for publication.
Previous Message Yugo NAGATA 2021-11-04 00:31:52 Re: pgbench bug candidate: negative "initial connection time"