Re: Design of pg_stat_subscription_workers vs pgstats

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design of pg_stat_subscription_workers vs pgstats
Date: 2022-02-02 04:11:40
Message-ID: CAKFQuwb8yaWxxH-gSt4NG9HhVnmKK_GnCEotVtjG1JQohOb0Qw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
> wrote:
>
> >
> > I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
> > feature to pass error-XID or error-LSN information to the worker
> > whereas I'm also not sure of the advantages in storing all error
> > information in a system catalog. Since what we need to do for this
> > purpose is only error-XID/LSN, we can store only error-XID/LSN in the
> > catalog? That is, the worker stores error-XID/LSN in the catalog on an
> > error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
> > the transaction in question. The worker clears the error-XID/LSN after
> > successfully applying or skipping the first non-empty transaction.
> >
>
> Where do you propose to store this information?

pg_subscription_worker

The error message and context is very important. Just make sure it is only
non-null when the worker state is "syncing failed" (or whatever term we
use).

Records are removed upon server restart (the launcher can handle this).
Consider recording a last activity timestamp (some protection/visibility
against bugs or, say, a worker ending without reporting that fact).
Records stay around even when the worker goes away (the user can filter the
state field to omit inactive rows). I'd consider just removing them when
done and/or having a reset function that the DBA could run (it should never
be wrong to clear the table).

Re: XID and/or LSN, I don't know enough yet to really judge this...

The other possibility
> could be to invent a new catalog for this info but I guess it will
> then have to have some duplicate info from pg_subscription/_rel.

> The other point is after this, do we want an interface where the user
> can also be allowed to specify error_lsn or error_xid?

...but whatever is decided, tell me, the user, what my options are, the
limitations, and what info to copy from this catalog into the command(s)
that I issue to the server, that will make the errors go away. This is
generic, not specific to the skipping a commit command or the skip-to-lsn
functions, but also includes considering performing DML on the relevant
table(s) to avoid the error.

I don't think the fields would be duplicated. While some of the fields
seem similar, aside from the key fields the data we would show would be
state info for a given worker. None of the v14 fields do this at the
worker scope.

That all makes the new catalog a generally useful monitoring source and a
standalone patch. I'd personally start a new thread, with a functioning
patch as the first message, and a recap of what and why this rework is
being done. In order for Andres to make progress on the shared memory
statistics patch I would suggest reverting this and building the new patch
as if this statistics collector approach never happened.

I'd still like to get some clarity regarding the observation that our
error-die-restart process seems problematic. Since that process needs to
talk to the new catalog anyway I'd rather commit the changes to the process
(if any, but I hope we can either all agree on the status quo or get
something better in for v15), and the new catalog that provides insight
into that process, as part of this first commit. That includes a probable
user function to restart a halted worker instead of doing so continually
(even with the suggested back-off protocol).

Then the SKIP commit can go in, leveraging the state information exposed in
the catalog. That discussion and work should be restarted on a new thread
with an intro recap message. The existing patch should be adapted to
leverage the new pg_subscription_worker catalog before starting the new
thread.

David J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2022-02-02 05:34:58 Re: Make mesage at end-of-recovery less scary.
Previous Message Amit Kapila 2022-02-02 03:33:09 Re: Doc: CREATE_REPLICATION_SLOT command requires the plugin name