Re: Clarification in pg10's pgupgrade.html step 10 (upgrading standby servers)

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Andreas Joseph Krogh <andreas(at)visena(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Clarification in pg10's pgupgrade.html step 10 (upgrading standby servers)
Date: 2017-09-12 23:38:40
Message-ID: 20170912233840.GS4628@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Bruce, all,

* Bruce Momjian (bruce(at)momjian(dot)us) wrote:
> On Wed, Sep 13, 2017 at 12:40:32AM +0200, Andreas Joseph Krogh wrote:
> > På tirsdag 12. september 2017 kl. 23:52:02, skrev Bruce Momjian <
> > bruce(at)momjian(dot)us>:
> >
> > On Tue, Sep 12, 2017 at 08:59:05PM +0200, Andreas Joseph Krogh wrote:
> > >     Improvements?
> > >
> > > Thanks, that certainly improves things.
> > > But; I still find the rsync-command in f) confusing;
> > > 1. Why --size-only? From rsync manual: "skip files that match in size",
> > is this
> > > safe??
> >
> >
> > > 2. Why is old_pgdata in the rsync-command, why is it needed to sync it?
> >
> > If the file exists under the same name, it doesn't need to be checked at
> > all --- it is the same.  We don't want to check the file modification
> > time because it will probably be different because of replay delay or
> > clock drift.  We could use checksums, but there is no need since there is
> > no way the file contents could be different.
> >
> >  
> >  
> > So you're saying that if the file exists (has the same name) on the standby (in
> > old_pgdata), and has the same size, then you're safe that it contains the same
> > data, hence --size-only?
> > Does this apply when not using --link mode for pg_upgrade?
>
> Well, it is really true in every case. For link mode, we have to use an
> rsync command that lists both the old and new clusters on the command
> line (since we need rsync to see those hard links to reproduce them). If
> we don't use --size-only, we are going to checksum check the _old_ data
> cluster. The new cluster will be empty so we will copy all of that (no
> need for a checksum there since there are no files). I think you need
> size-only even without link since that old cluster is going to be listed
> for rsync.

The above is correct- the old and new are required to get rsync to build
the same hard-link tree on the replica as exists on the primary, post
pg_upgrade. Also, if --link isn't used with pg_upgrade then you'd want
--size-only with the existing command or you'd end up probably copying
both the old and new clusters and that'd be a lot of additional work.

Other points of clarification here:

Rsync, by default, does *not* use checksums.

The data files on the replica and the data files on the primary do *not*
match bit-for-bit, --checksum will never work (or, rather, it'll always
end up copying everything except in extremely rare circumstances that
would be pure luck). What matters, however, is that the differences
aren't interesting to PG, any more than they are when it comes to doing
WAL replay.

If --link is *not* used with pg_upgrade, then there's not much point in
using this rsync as it shouldn't be particularly different from just
doing the typical:

rsync --archive new_pgdata remote_dir

post pg_upgrade, though of course that would incur a large amount of
data transfer across the network.

I wouldn't suggest trying to copy the old data dir on the remote to the
new data dir and then doing an rsync- that way lies madness as you would
be copying over catalog files from the old data dir and those could end
up having the same size as the same catalog files post-upgrade on the
primary and then you end up with some odd mix between the two. That's
bad. You'd have to identify the catalog files independently and be sure
to exclude them from the copy and that isn't something I would encourage
anyone to try and do. The rsync --hard-link method with pg_upgrade
--link will get this correct, to be clear.

> Now, what you could do, if you are _not_ using link mode, is to rsync
> only the new cluster, but the instructions we give work the same for
> link and non-link mode and produce the same results in the same time
> even if we had a non-link-mode example, so it seems we might as well
> just give one set of instructions.

For my 2c, at least, I would have specifically pointed out that this
method is really only for when you're using --link mode with pg_upgrade.
If you're not using --link then there's other ways to do this which
would be more efficient than an rsync and which could be done after the
primary is back online (such as doing a backup/restore to rebuild the
replica, or similar).

> > > There are many ways to do/configure things it seems, resulting in many
> > ifs and
> > > buts which makes section 10 rather confusing. I really think a complete
> > > example, with absolute paths, would be clarifying.
> >
> > You mean a full rsync command, e.g.:
> >
> >   rsync --archive --delete --hard-links --size-only \
> >       /opt/PostgreSQL/9.5 /opt/PostgreSQL/9.6 standby:/opt/PostgreSQL
> >
> > Does that help?
> >
> >  
> >  
> > It seems some non-obvious assumptions (to me at least) are made here.
> > This example seems only valid when using pg_upgrade --link, correct? If so it
> > would be clearer to the reader if explicitly stated.
>
> Well, as I stated above, --hard-links is only going to recreate hard
> links on the standby that exist on the primary, and if you didn't use
> pg_upgrade's --link mode, there will be none, so it is harmless if
> pg_upgrade --link mode was not used.

The rsync will recreate the hard links *and* copy the new catalog data
files over from the upgraded primary. It will specifically *not* copy
over into the new cluster anything from the old data dir, and that's
important.

I agree that --hard-links should be harmless if you're not using --link,
but as I say above, this approach doesn't really make sense if you're
not using --link and it can clearly be confusing to people to not have
this method caveated in that way.

> > 1. Why do you have to rsync both /opt/PostgreSQL/9.5 AND /opt/PostgreSQL/9.6,
> > wouldn't /opt/PostgreSQL/9.6 suffice? Or does this assume "pg_upgrade --link"
> > AND "rsync --hard-links" and therefore it somewhat needs to transfer less data?
>
> As I stated above, rsync has to see _both_ hard links on the primary to
> recreate them on the standby. I thought the doc patch was clear on
> that, but obviously not. :-( Suggestions? (Yes, I admit that using
> rsync in this way is super-crafty, and I would _love_ to take credit for
> the idea, but I think the award goes to Stephen Frost.)

Indeed, this is a method I've used previously, with good success, to
speed up getting a replica back online following a pg_upgrade. There's
some additional caveats on it that we haven't even discussed yet here:

Unlogged tables will end up getting copied by this rsync. That's not
the end of the world and won't harm anything, afaik, but having all the
unlogged data copied to the replicas ends up using space on the replicas
unnecessairly and will make the transfer of data take longer as well.

> > 2. What would the rsync command look like if pg_upgrade wasn't issued with
> > --link?
>
> It would look like:
>
> rsync --archive /opt/PostgreSQL/9.6 standby:/opt/PostgreSQL/9.6

Right.

> but effectively there isn't anything _in_ standby:/opt/PostgreSQL/9.6,
> so you are really just using rsync as cp, and frankly I have found 'cp'
> is faster than rsync when nothing exists on the other side so it really
> becomes "just copy the cluster when the server is down", but I don't
> think people even need instructions for that.

Well, the above rsync would go over the network whereas a traditional
'cp' won't.

I tend to agree that we don't really need a lot of documentation around
"copy the resulting cluster to the replica while the server is down",
but that then goes against your argument above that this approach is
good for both --link and without --link.

> Maybe we should recommend rsync only for pg_upgrade --link mode?

Yes, I think we should.

Further, really, I think we should provide a utility to do all of the
above instead of using rsync- and that utility should do some additional
things, such as:

- Check that the control file on the primary and replica show that they
reached the same point prior to the pg_upgrade. If they didn't, then
things could go badly as there's unplayed WAL that the primary got
through and the replica didn't.

- Not copy over unlogged data, or any other information that shouldn't
be copied across.

- Allow the directory structures to be more different between the
primary and the replica than rsync allows (wouldn't have to have a
common subdirectory on the replica).

- Perhaps other validation checks or similar.

Unfortunately, this is a bit annoying as it necessairly involves running
things on both the primary and the replica from the same tool, without
access to PG, meaning we'd have to work through something else (such as
SSH, like rsync does, but then what would we do for Windows...?).

> > 3. What if the directory-layout isn't the same on primary and standby, ie.
> > tablespaces are located differently?
>
> The way we reconfigured the location of tablespaces in PG 9.0 is that
> each major version of Postgres places its tablespace in a subdirectory
> of the tablespace directory, so there is tbldir/9.5 and tbldir/9.6. If
> your tbldir is different on the primary and standby, rsync will still
> work. Everything _under_ the standby dir must be laid out the same, but
> the directories above it can be different.

That's correct, the directory to use for the tablespace actually *is*
the tablespace directory (unlike the base directories, it doesn't need
to be a directory above the tablespace directory, the documentation
could probably be clearer on this point).

As for all of the people raising concerns about if this process is
correct or valid- I contend that the method used above, if done
properly, isn't materially different from what pg_upgrade itself does.
If we can't consider this safe then I'm not sure how we consider
pg_upgrade safe. (yes, I know there are some who don't, and that's a
fair position to take also, but I consider the process above, when
implemented correctly, is essentially the same).

All that said, I honestly hadn't expected this method to end up in the
documentation. Not because I don't trust it or because I wanted to
hoard the process, but because it takes a great deal of care and there's
really additional validation that should be done (as discussed above)
and those are things that I feel reasonable confident I'd remember to do
when using such a procedure but which I wouldn't expect someone new to
PG to realize they should do.

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-09-12 23:40:24 Re: psql - add special variable to reflect the last query status
Previous Message Andreas Joseph Krogh 2017-09-12 23:35:17 Re: Clarification in pg10's pgupgrade.html step 10 (upgrading standby servers)