Re: Skipping logical replication transactions on subscriber side

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, Alexey Lesovsky <lesovsky(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Skipping logical replication transactions on subscriber side
Date: 2022-01-25 22:05:22
Message-ID: CAKFQuwZbFuPSV1WLiNFuODst1sUZon2Qwbj8d9tT=38hMhJfvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 25, 2022 at 8:33 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:

> Given that we cannot use rely on the pg_stat_subscription_workers view
> for this purpose, we would need either a new sub-system that tracks
> each logical replication status so the system can set the error XID to
> subskipxid, or to wait for shared-memory based stats collector.
>

I'm reading over the monitoring-stats page to try and get my head around
all of this. First of all, it defines two kinds of views:

1. PostgreSQL's statistics collector is a subsystem that supports
collection and reporting of information about server activity.
2. PostgreSQL also supports reporting dynamic information ... This facility
is independent of the collector process.

In then has two tables:

28.1 Dynamic Statistics Views (describing #2 above)
28.2 Collected Statistics Views (describing #1 above)

Apparently the "collector process" is UDP-like, not reliable. The
documentation fails to mention this fact. I'd argue that this is a
documentation bug.

I do see that the pg_stat_subscription_workers view is correctly placed in
Table 28.2

Reviewing the other views listed in that table only pg_stat_archiver abuses
the statistics collector in a similar fashion. All of the others are
actually metric oriented.

I don't care for the specification: "will contain one row per subscription
worker on which errors have occurred, for workers applying logical
replication changes and workers handling the initial data copy of the
subscribed tables."

I would much rather have this behave similar to pg_stat_activity (which, of
course, is a Dynamic Statistics View...) in that it shows only and all
workers that are presently working. The tablesync workers should go away
when they have finished synchronizing. I should not have to manually
intervene to get rid of unreliable expired data. The log file feels like a
superior solution to this monitoring view.

Alternatively, if the tablesync workers are done but we've been
accumulating real statistics for them, then by all means keep them included
in the view - but regardless of whether they encountered an error. But
maybe the view can right join in pg_stat_subscription as show a column for
"(pid is not null) AS is_active".

Maybe we need to add a track_finished_tablesync_workers GUC so the DBA can
decide whether to devote storage and processing resources to that
historical information.

If you had kept the original view name, "pg_stat_subscription_error", this
whole issue goes away. But you decided to make it more generic and call it
"pg_stat_subscription_workers" - which means you need to get rid of the
error-specific condition in the WHERE clause for the view. Show all
workers - I can filter on is_active. Showing only active workers is also
acceptable. You won't get to change your mind so decide whether this wants
to show only current and running state or whether historical statistics for
now defunct tablesync workers are desired. Personally, I would just show
active workers and if someone wants to add the feature they can add a
track_tablesync_worker_stats GUC and a matching view.

From that, every apply worker should be sending a statistics message to the
collector periodically. If error info is not present and the state is "all
is well", clear out any existing error info from the view. The attempt to
include an actual statistic field here doesn't seem useful nor redeeming.
I would add a "state" field in its place (well, after subrelid). And I
would still rename the columns to current_error_* and note that these
should be null unless the status field shows error (there may be some
additional complexity here). Just get rid of last_error_count.

David J.

P.S. I saw the discussion regarding pg_dump'ing the subskipid field. I
didn't notice any discussion around creating and restoring a basebackup.
It seems like during server startup subskipid should just be cleared out.
Then it doesn't matter what one does during backup.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2022-01-25 22:26:31 Re: Support for NSS as a libpq TLS backend
Previous Message Pavel Stehule 2022-01-25 21:53:00 Re: Schema variables - new implementation for Postgres 15