Re: pg_upgrade and logical replication

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: Re: pg_upgrade and logical replication
Date: 2023-10-30 09:35:09
Message-ID: CALDaNm0jWM1N6U_=P75kxgWWVaLak8oxOu1XDQQhbe=1h2dy7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 27 Oct 2023 at 12:09, vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Thu, 21 Sept 2023 at 11:27, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > On Fri, Sep 15, 2023 at 03:08:21PM +0530, vignesh C wrote:
> > > On Tue, 12 Sept 2023 at 14:25, Hayato Kuroda (Fujitsu)
> > > <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> > >> Is there a possibility that apply worker on old cluster connects to the
> > >> publisher during the upgrade? Regarding the pg_upgrade on publisher, the we
> > >> refuse TCP/IP connections from remotes and port number is also changed, so we can
> > >> assume that subscriber does not connect to. But IIUC such settings may not affect
> > >> to the connection source, so that the apply worker may try to connect to the
> > >> publisher. Also, is there any hazards if it happens?
> > >
> > > Yes, there is a possibility that the apply worker gets started and new
> > > transaction data is being synced from the publisher. I have made a fix
> > > not to start the launcher process in binary ugprade mode as we don't
> > > want the launcher to start apply worker during upgrade.
> >
> > Hmm. I was wondering if 0001 is the right way to handle this case,
> > but at the end I'm OK to paint one extra isBinaryUpgrade in the code
> > path where apply launchers are registered. I don't think that the
> > patch is complete, though. A comment should be added in pg_upgrade's
> > server.c, exactly start_postmaster(), to tell that -b also stops apply
> > workers. I am attaching a version updated as of the attached, that
> > I'd be OK to apply.
>
> I have added comments
>
> > I don't really think that we need to worry about a subscriber
> > connecting back to a publisher in this case, though? I mean, each
> > postmaster instance started by pg_upgrade restricts the access to the
> > instance with unix_socket_directories set to a custom path and
> > permissions at 0700, and a subscription's connection string does not
> > know the unix path used by pg_upgrade. I certainly agree that
> > stopping these processes could lead to inconsistencies in the data the
> > subscribers have been holding though, if we are not careful, so
> > preventing them from running is a good practice anyway.
>
> I have made the fix similar to how upgrade publisher has done to keep
> it consistent.
>
> > I have also reviewed 0002. As a whole, I think that I'm OK with the
> > main approach of the patch in pg_dump to use a new type of dumpable
> > object for subscription relations that are dumped with their upgrade
> > functions after. This still needs more work, and more documentation.
>
> Added documentation
>
> > Also, perhaps we should really have an option to control if this part
> > of the copy happens or not. With a --no-subscription-relations for
> > pg_dump at least?
>
> Currently this is done by default in binary upgrade mode, I will add a
> separate patch to skip dump of subscription relations from upgrade and
> dump a little later.
>
> >
> > +{ oid => '4551', descr => 'add a relation with the specified relation state to pg_subscription_rel table',
> >
> > During a development cycle, any new function added needs to use an OID
> > in range 8000-9999. Running unused_oids will suggest new random OIDs.
>
> Modified
>
> > FWIW, I am not convinced that there is a need for two functions to add
> > an entry to pg_subscription_rel, with sole difference between both the
> > handling of a valid or invalid LSN. We should have only one function
> > that's able to handle NULL for the LSN. So let's remove rel_state_a
> > and rel_state_b, and have a single rel_state(). The description of
> > the SQL functions is inconsistent with the other binary upgrade ones,
> > I would suggest for the two functions
> > "for use by pg_upgrade (relation for pg_subscription_rel)"
> > "for use by pg_upgrade (remote_lsn for origin)"
>
> Removed rel_state_a and rel_state_b and updated the description accordingly
>
> > + i_srsublsn = PQfnumber(res, "srsublsn");
> > [...]
> > + subrinfo[cur_rel].srsublsn = pg_strdup(PQgetvalue(res, i, i_srsublsn));
> >
> > In getSubscriptionTables(), this should check for PQgetisnull()
> > because we would have a NULL value for InvalidXLogRecPtr in the
> > catalog. Using a char* for srsublsn is OK, but just assign NULL to
> > it, then just pass a hardcoded NULL value to the function as we do in
> > other places. So I don't quite get why this is not the same handling
> > as suboriginremotelsn.
>
> Modified
>
> >
> > getSubscriptionTables() is entirely skipped if we don't want any
> > subscriptions, if we deal with a server of 9.6 or older or if we don't
> > do binary upgrades, which is OK.
> >
> > +/*
> > + * getSubscriptionTables
> > + * get information about subscription membership for dumpable tables.
> > + */
> > This commit is slightly misleading and should mention that this is an
> > upgrade-only path?
>
> Modified
>
> >
> > The code for dumpSubscriptionTable() is a copy-paste of
> > dumpPublicationTable(), but a lot of what you are doing here is
> > actually pointless if we are not in binary mode? Why should this code
> > path not taken only under dataOnly? I mean, this is a code path we
> > should never take except if we are in binary mode. This should have
> > at least a cross-check to make sure that we never have a
> > DO_SUBSCRIPTION_REL in this code path if we are in non-binary mode.
>
> I have added an assert in this case, as it is not expected to come
> here in non binary mode
>
> > + if (dopt->binary_upgrade && subinfo->suboriginremotelsn)
> > + {
> > + appendPQExpBufferStr(query,
> > + "SELECT pg_catalog.binary_upgrade_replorigin_advance(");
> > + appendStringLiteralAH(query, subinfo->dobj.name, fout);
> > + appendPQExpBuffer(query, ", '%s');\n", subinfo->suboriginremotelsn);
> > + }
> >
> > Hmm.. Could it be actually useful even for debugging to still have
> > this query if suboriginremotelsn is an InvalidXLogRecPtr? I think
> > that this should have a comment of the kind "\n-- For binary upgrade,
> > blah". At least it would not be a bad thing to enforce a correct
> > state from the start, removing the NULL check for the second argument
> > in binary_upgrade_replorigin_advance().
>
> Modified
>
> > + /* We need to check for pg_replication_origin_status only once. */
> > Perhaps it would be better to explain why?
>
> This remote_lsn code change is actually not required, I have removed this now.
>
> >
> > + "WHERE coalesce(remote_lsn, '0/0') = '0/0'"
> > Why a COALESCE here? Cannot this stuff just use NULL?
>
> This remote_lsn code change is actually not required, I have removed this now.
>
> > + fprintf(script, "database:%s subscription:%s relation:%s in non-ready state\n",
> > Could it be possible to include the schema of the relation in this log?
>
> Modified
>
> > +static void check_for_subscription_state(ClusterInfo *cluster);
> > I'd be tempted to move that into a patch on its own, actually, for a
> > cleaner history.
>
> As of now I have kept it together, I will change it later based on
> more feedback from others
>
> > +# Copyright (c) 2022-2023, PostgreSQL Global Development Group
> > New as of 2023.
>
> Modified
>
> > +# Check that after upgradation of the subscriber server, the incremental
> > +# changes added to the publisher are replicated.
> > [..]
> > + For upgradation of the subscriptions, all the subscriptions on the old
> > + cluster must have a valid <varname>remote_lsn</varname>, and all the
> >
> > Upgradation? I think that this should be reworded:
> > "All the subscriptions of an old cluster require a valid remote_lsn
> > during an upgrade."
>
> This remote_lsn code change is actually not required, I have removed this now.
>
> >
> > A CI run is reporting the following compilation warnings:
> > [04:21:15.290] pg_dump.c: In function ‘getSubscriptionTables’:
> > [04:21:15.290] pg_dump.c:4655:29: error: ‘subinfo’ may be used
> > uninitialized in this function [-Werror=maybe-uninitialized]
> > [04:21:15.290] 4655 | subrinfo[cur_rel].subinfo = subinfo;
>
> I have initialized and checked with [-Werror=maybe-uninitialized],
> let me check in the next cfbot run
>
>
> > +ok(-d $new_sub->data_dir . "/pg_upgrade_output.d",
> > + "pg_upgrade_output.d/ not removed after pg_upgrade failure");
> > Not sure that there's a need for this check. Okay, that's cheap.
>
> Modified
>
> > And, err. We are going to need an option to control if the slot data
> > is copied, and a bit more documentation in pg_upgrade to explain how
> > things happen when the copy happens.
> Added documentation for this, we will copy the slot data by default,
> we will add a separate patch to skip dump of subscription
> relations/replication slot from upgrade and dump a little later.
>
> The attached v9 version patch has the changes for the same.
>
> Apart from this I'm still checking that the old cluster's subscription
> relations states are READY state still, but there is a possibility
> that SYNCDONE or FINISHEDCOPY could work, this needs more thought
> before concluding which is the correct state to check. Let' handle
> this in the upcoming version.

The patch was not applying because of recent commits. Here is a
rebased version of the patches.

Regards,
Vignesh

Attachment Content-Type Size
v9_20231030-0001-Prevent-startup-of-logical-replication-launcher-i.patch text/x-patch 2.0 KB
v9_20231030-0002-Preserve-the-full-subscription-s-state-during-pg_.patch text/x-patch 36.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-10-30 09:45:52 Explicitly skip TAP tests under Meson if disabled
Previous Message Dean Rasheed 2023-10-30 09:33:53 Re: Supporting MERGE on updatable views