Re: Skipping logical replication transactions on subscriber side

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, Alexey Lesovsky <lesovsky(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Skipping logical replication transactions on subscriber side
Date: 2022-01-26 04:10:00
Message-ID: CAD21AoA8_ArnMhnKTAH+VN0mEOQWaCYYL9zVWGdSx+u5mAJ2uA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 26, 2022 at 7:05 AM David G. Johnston
<david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
>
> On Tue, Jan 25, 2022 at 8:33 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>
>> Given that we cannot use rely on the pg_stat_subscription_workers view
>> for this purpose, we would need either a new sub-system that tracks
>> each logical replication status so the system can set the error XID to
>> subskipxid, or to wait for shared-memory based stats collector.
>
>
> I'm reading over the monitoring-stats page to try and get my head around all of this. First of all, it defines two kinds of views:
>
> 1. PostgreSQL's statistics collector is a subsystem that supports collection and reporting of information about server activity.
> 2. PostgreSQL also supports reporting dynamic information ... This facility is independent of the collector process.
>
> In then has two tables:
>
> 28.1 Dynamic Statistics Views (describing #2 above)
> 28.2 Collected Statistics Views (describing #1 above)
>
> Apparently the "collector process" is UDP-like, not reliable. The documentation fails to mention this fact. I'd argue that this is a documentation bug.
>
> I do see that the pg_stat_subscription_workers view is correctly placed in Table 28.2
>
> Reviewing the other views listed in that table only pg_stat_archiver abuses the statistics collector in a similar fashion. All of the others are actually metric oriented.
>
> I don't care for the specification: "will contain one row per subscription worker on which errors have occurred, for workers applying logical replication changes and workers handling the initial data copy of the subscribed tables."
>
> I would much rather have this behave similar to pg_stat_activity (which, of course, is a Dynamic Statistics View...) in that it shows only and all workers that are presently working.

I have no objection against having a dynamic statistics view showing
the status of each running worker but I think it should be implemented
in a separate view and not be something that replaces the
pg_stat_subscription_workers. I think pg_stat_subscription would be
the right place for it.

> The tablesync workers should go away when they have finished synchronizing. I should not have to manually intervene to get rid of unreliable expired data. The log file feels like a superior solution to this monitoring view.
>
> Alternatively, if the tablesync workers are done but we've been accumulating real statistics for them, then by all means keep them included in the view - but regardless of whether they encountered an error. But maybe the view can right join in pg_stat_subscription as show a column for "(pid is not null) AS is_active".
>
> Maybe we need to add a track_finished_tablesync_workers GUC so the DBA can decide whether to devote storage and processing resources to that historical information.
>
> If you had kept the original view name, "pg_stat_subscription_error", this whole issue goes away. But you decided to make it more generic and call it "pg_stat_subscription_workers" - which means you need to get rid of the error-specific condition in the WHERE clause for the view. Show all workers - I can filter on is_active. Showing only active workers is also acceptable. You won't get to change your mind so decide whether this wants to show only current and running state or whether historical statistics for now defunct tablesync workers are desired. Personally, I would just show active workers and if someone wants to add the feature they can add a track_tablesync_worker_stats GUC and a matching view.

We plan to clear/remove table sync entries who finished synchronization.

It’s better not to merge dynamic statistics such as pid and is_active
and accumulative statistics into one view. I think we can have both
views: pg_stat_subscription_workers view with some changes based on
the review comments (e.g., removing defunct tablesync entry), and
another view showing dynamic statistics such as the worker status.

> From that, every apply worker should be sending a statistics message to the collector periodically. If error info is not present and the state is "all is well", clear out any existing error info from the view. The attempt to include an actual statistic field here doesn't seem useful nor redeeming. I would add a "state" field in its place (well, after subrelid). And I would still rename the columns to current_error_* and note that these should be null unless the status field shows error (there may be some additional complexity here). Just get rid of last_error_count.
>

I don't think that using the stats collector to show the current
status of each worker is a good idea because of 500ms lag, UDP
connection etc. Even if error info is not present and the state is
good according to the view, it might be out-of-date or simply not
true. If we want to do that, it’s much better to prepare something on
shmem so each worker can store its status (running or error, error
xid, etc.) and have pg_stat_subscription (or another view) show the
information. One thing we need to consider is that it needs to leave
the status even after exiting apply/tablesync worker but we don't know
how many statuses for workers we need to allocate on the shmem at
startup time.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-01-26 04:15:51 Re: Skipping logical replication transactions on subscriber side
Previous Message Masahiko Sawada 2022-01-26 04:05:56 Re: Skipping logical replication transactions on subscriber side