Re: Design of pg_stat_subscription_workers vs pgstats

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design of pg_stat_subscription_workers vs pgstats
Date: 2022-01-27 20:18:51
Message-ID: CAKFQuwaGo9zOhr_E3eYm+Pevvzi=MCGYo5nkzvkuX35wiyysYw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 27, 2022 at 5:08 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Thu, Jan 27, 2022 at 11:16 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > On 2022-01-25 20:27:07 +0900, Masahiko Sawada wrote:
> >
> > > There will be some challenges in a case where updating
> pg_subscription_rel
> > > also failed too (what to report to the user, etc.). And moreover, we
> don't
> > > want to consume space for temporary information in the system catalog.
> >
> > You're consuming resources in a *WAY* worse way right now. The stats
> file gets
> > constantly written out, and quite often read back by backends. In
> contrast to
> > parts of pg_subscription_rel or such that data can't be removed from
> > shared_buffers under pressure.
> >
>
> I don't think pg_subscription_rel is the right place to store error
> info as the error can happen say while processing some message type
> like BEGIN where we can't map it to pg_subscription_rel entry. There
> could be other cases as well where we won't be able to map it to
> pg_subscription_rel like some error related to some other table while
> processing trigger function.
>
> In general, there doesn't appear to be much advantage in storing all
> the error info in system catalogs as we don't want it to be persistent
> (crash-safe). Also, this information is not about any system object
> that future operations can use, so won't map from that angle as well.
>

Repeating myself here to try and keep complaints regarding
pg_stat_subscription_worker in one place.

This is my specific email with respect to the pg_stat_scription_workers
design.

https://www.postgresql.org/message-id/CAKFQuwZbFuPSV1WLiNFuODst1sUZon2Qwbj8d9tT%3D38hMhJfvw%40mail.gmail.com

Specifically,

pg_stat_subscription_workers is defined as showing:
"will contain one row per subscription
worker on which errors have occurred, for workers applying logical
replication changes and workers handling the initial data copy of the
subscribed tables."

The fact that these errors remain (last_error_*) even after they are no
longer relevant is my main gripe regarding this feature. The information
itself is generally useful though last_error_count is not. These fields
should auto-clear and be named current_error_* as they exist for purposes
of describing the current state of any error-encountering logical
replication worker so that ALTER SUBSCRIPTION SKIP, or someone other manual
intervention, can be done with that knowledge without having to scan the
subscriber's server logs.

This is my email trying to understand reality better in order to figure out
what exactly is causing the limitations that are negatively impacting the
design of this feature.

https://www.postgresql.org/message-id/CAKFQuwYJ7dsW%2BStsw5%2BZVoY3nwQ9j6pPt-7oYjGddH-h7uVb%2Bg%40mail.gmail.com

In short, it was convenient to use the statistics collector here even if
doing so resulted in a non-user friendly (IMO) design. Given all of the
limitations to the statistics collection infrastructure, and the fact that
this data is not statistical in the usual usage of the term, I find that to
be less than satisfying. To the point that I'd be inclined to revert this
feature and hold up the ALTER SUBSCRIPTION SET patch until a more
user-friendly design can be done using proper IPC techniques. (I also noted
in the first email that pg_stat_archiver, solely by observing the column
names it exposes, shares this same abuse of the statistics collector for
non-statistical data).

In my second email I did some tracing and ended up at the PG_CATCH() block
in src/backend/replication/logical/worker.c:L3629. When mentioning trying
to get rid of the PG_RE_THROW() there apparently doing so completely is
unwarranted due to fatal/panic errors. I am curious that the addition of
the statistic reporting logic doesn't seem to care about the same. And in
any case, while maybe PG_RE_THROW() cannot go away it could maybe be done
conditionally, and the worker still allowed to exit should that be more
desirable than making the routine safe for looping after an error.

Andres, I do not know how to be more precise than your comment "But: You
don't need to. Just abort the current transaction, start a new one, and
update the state.". When I suggested that idea it didn't seem to resonate
with anyone on the other thread. Starting at the main PG_TRY() loop in
worker.c noted above, could you maybe please explain in a bit more detail
whether, and how hard, it would be to go from "just PG_RE_THROW();" to
"abort and start a new transaction"?

David J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-01-27 20:20:18 Why is INSERT-driven autovacuuming based on pg_class.reltuples?
Previous Message Nathan Bossart 2022-01-27 20:18:16 Re: Two noncritical bugs of pg_waldump