RE: Failed transaction statistics to measure the logical replication progress

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>
Cc: "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "vignesh21(at)gmail(dot)com" <vignesh21(at)gmail(dot)com>, "sawada(dot)mshk(at)gmail(dot)com" <sawada(dot)mshk(at)gmail(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Failed transaction statistics to measure the logical replication progress
Date: 2022-02-18 08:34:01
Message-ID: TYCPR01MB837343C96B6045E97F311393ED379@TYCPR01MB8373.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, February 17, 2022 6:45 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Thu, Feb 17, 2022 at 3:13 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >
> > On Tue, Jan 4, 2022 at 5:22 PM osumi(dot)takamichi(at)fujitsu(dot)com
> > <osumi(dot)takamichi(at)fujitsu(dot)com> wrote:
> > >
> > > On Friday, December 31, 2021 10:12 AM Tang, Haiying/唐 海英
> <tanghy(dot)fnst(at)fujitsu(dot)com> wrote:
> > > > 4)
> > > > +void
> > > > +pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker,
> > > > +bool
> > > > +force) {
> > > > + static TimestampTz last_report = 0;
> > > > + PgStat_MsgSubWorkerXactEnd msg;
> > > > +
> > > > + if (!force)
> > > > + {
> > > > ...
> > > > + if (!TimestampDifferenceExceeds(last_report, now,
> > > > PGSTAT_STAT_INTERVAL))
> > > > + return;
> > > > + last_report = now;
> > > > + }
> > > > +
> > > > ...
> > > > + if (repWorker->commit_count == 0 && repWorker->abort_count
> > > > + ==
> > > > 0)
> > > > + return;
> > > > ...
> > > >
> > > > I think it's better to check commit_count and abort_count first,
> > > > then check if reach PGSTAT_STAT_INTERVAL.
> > > > Otherwise if commit_count and abort_count are 0, it is possible
> > > > that the value of last_report has been updated but it didn't send
> > > > stats in fact. In this case, last_report is not the real time that send last
> message.
> > > Yeah, agreed. This fix is right in terms of the variable name aspect.
> > >
> >
> > Can't we use pgstat_report_stat() here? Basically, you can update xact
> > completetion counters during apply, and then from
> > pgstat_report_stat(), you can invoke a logical replication worker
> > stats-related function.
> >
>
> If we can do this then we can save the logic this patch is trying to introduce for
> PGSTAT_STAT_INTERVAL.
Hi, I've encounter a couple of questions during my modification, following your advice.

In the pgstat_report_stat, we refer to the return value of
GetCurrentTransactionStopTimestamp to compare the time different from the last time.
(In my previous patch, I used GetCurrentTimestamp)

This time is updated in apply_handle_commit_internal's CommitTransactionCommand for the apply worker.
Then, if I update the subscription worker stats(commit_count/abort_count) immediately after
this CommitTransactionCommand and immediately call pgstat_report_stat in the apply_handle_commit_internal,
the time difference becomes too small (falls short of PGSTAT_STAT_INTERVAL).
Also, the time of GetCurrentTransactionStopTimestamp is not updated
even when I keep calling pgstat_report_stat repeatedly.
Then, IIUC the next possible timing that message of commit_count or abort_count
is sent to the stats collector would become the time when we execute another transaction
by the apply worker and update the time for GetCurrentTransactionStopTimestamp
and rerun pgstat_report_stat again.

So, if we keep GetCurrentTransactionStopTimestamp without change,
an update of stats depends on another new subsequent transaction if we simply merge those.
(this leads to users cannot see the latest stats information ?)
At least, I got a test failure because of this function for streaming commit case
because it uses poll_query_until to wait for stats update.

On the other hand, replacing GetCurrentTransactionStopTimestamp with
GetCurrentTimestamp in case of apply worker looks have another negative impact.
If we do so, it becomes possible that we go into the code to scan TabStatusArray with
PgStat_TableStatus's trans with non-null values, because of the timing change.

I might be able to avoid this kind of assert failure if I write the message send
function of this patch before other existing functions to send various type of messages
and return if the process is apply worker in pgstat_report_stat.
But, I can't be convinced that this way of modification is OK.

What did you think about those issues ?

Best Regards,
Takamichi Osumi

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2022-02-18 08:59:46 Re: [Proposal] Add foreign-server health checks infrastructure
Previous Message Masahiko Sawada 2022-02-18 08:26:04 Re: Design of pg_stat_subscription_workers vs pgstats