Re: Failed transaction statistics to measure the logical replication progress

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>
Cc: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "vignesh21(at)gmail(dot)com" <vignesh21(at)gmail(dot)com>, "sawada(dot)mshk(at)gmail(dot)com" <sawada(dot)mshk(at)gmail(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Greg Nancarrow <gregn4422(at)gmail(dot)com>
Subject: Re: Failed transaction statistics to measure the logical replication progress
Date: 2022-02-23 06:30:26
Message-ID: CAA4eK1LWYc15=ASj1tMTEFsXtxu=02aGoMwq9YanUVr9-QMhdQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 22, 2022 at 6:45 AM tanghy(dot)fnst(at)fujitsu(dot)com
<tanghy(dot)fnst(at)fujitsu(dot)com> wrote:
>
> I found a problem when using it. When a replication workers exits, the
> transaction stats should be sent to stats collector if they were not sent before
> because it didn't reach PGSTAT_STAT_INTERVAL. But I saw that the stats weren't
> updated as expected.
>
> I looked into it and found that the replication worker would send the
> transaction stats (if any) before it exits. But it got invalid subid in
> pgstat_send_subworker_xact_stats(), which led to the following result:
>
> postgres=# select pg_stat_get_subscription_worker(0, null);
> pg_stat_get_subscription_worker
> ---------------------------------
> (0,,2,0,0,,,,0,"",)
> (1 row)
>
> I think that's because subid has already been cleaned when trying to send the
> stats. I printed the value of before_shmem_exit_list, the functions in this list
> would be called in shmem_exit() when the worker exits.
> logicalrep_worker_onexit() would clean up the worker info (including subid), and
> pgstat_shutdown_hook() would send stats if any. logicalrep_worker_onexit() was
> called before calling pgstat_shutdown_hook().
>

Yeah, I think that is a problem and maybe we can think of solving it
by sending the stats via logicalrep_worker_onexit before subid is
cleared but not sure that is a good idea. I feel we need to go back to
the idea of v21 for sending stats instead of using pgstat_report_stat.
I think the one thing which we could improve is to avoid trying to
send it each time before receiving each message by walrcv_receive and
rather try to send it before we try to wait (WaitLatchOrSocket).
Trying after each message doesn't seem to be required and could lead
to some overhead as well. What do you think?

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2022-02-23 06:30:37 Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit
Previous Message vignesh C 2022-02-23 06:29:02 Handle infinite recursion in logical replication setup