Re: pg_upgrade project: high-level design proposal of in-place

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: "Serguei A(dot) Mokhov" <mokhov(at)cs(dot)concordia(dot)ca>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_upgrade project: high-level design proposal of in-place
Date: 2005-06-04 23:33:40
Message-ID: 200506042333.j54NXeo21780@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


I have added a link to this thread on the TODO list under TODO.detail.

---------------------------------------------------------------------------

Serguei A. Mokhov wrote:
> Hello dear all,
>
> [Please CC your replies to me as I am on the digest mode]
>
> Here's finally a very high-level design proposal of the pg_upgrade feature
> I was handwaiving a couple of weeks ago. Since, I am almost done with the
> moving, I can allocate some time for this for 8.1/8.2.
>
> If this topic is of interest to you, please read on until the very end
> before flaming or bashing the ideas out. I had designed that thing and
> kept updating (the design) more or less regularly, and also reflected some
> issues from the nearby threads [1] and [2].
>
> This design is very high-level at the moment and is not very detailed. I
> will need to figure out more stuff as I go and design some aspects in
> finer detail. I started to poke around asking for initdb-forcing code
> paths in [3], but got no response so far. But I guess if the general idea
> or, rather, ideas accepted I will insist on more information more
> aggressively :) if I can't figure something out for myself.
>
> [1] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00000.php
> [2] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00382.php
> [3] http://archives.postgresql.org/pgsql-hackers/2004-08/msg01594.php
>
> Comments are very welcome, especially _*CONSTRUCTIVE*_...
>
> Thank you, and now sit back and read...
>
> CONTENTS:
> =========
>
> 1. The Need
> 1. Utilities and User's View of the pg_upgrade Feature
> 2. Storage Management
> - Storage Managers and the smgr API
> 3. Source Code Maintenance Aspects
> 2. The Upgrade Sequence
> 4. Proposed Implementation Plan
> - initdb() API
> - upgrade API
>
>
> 1. The Need
> -----------
>
> It's been a problem for PG for quite awhile now to have a less painful
> upgrade procedure with every new revision of PostgreSQL, so the
> dump/restore sequence is required. That can take a while for a production
> DB, while keeping it offline. The new replication-related solutions, such
> as Slony I, pg_pool, and others can remedy the problem somewhat, but
> require to roughly double the storage requirements of a given database
> while replicating from the older server to a newer one.
>
> The proposed implementation of an in-server pg_upgrade facility attempts
> to address both issues at the same time -- a possibility to keep the
> server running and upgrading lazily w/o doubling the storage requirements
> (there will be some extra disk space taken, but far from doubling the
> size). The in-process upgrade will not take much of down time and won't
> require that much memory/disk/network resources as replication solutions
> do.
>
>
> Prerequisites
> -------------
>
> Ideally, the (maybe not so anymore) ambitious goal is to simply be able to
> "drop in" the new binaries of the new server and kick off on the older
> version of data files. I think is this feasible now a lot more than before
> since we have those things available, which should ease up the
> implementation:
>
> - bgwriter
> - pg_autovacuum (the one to be integrated into the backend in 8.1)
> - smgr API for pluggable storage managers
> - initdb in C
> - ...
>
> initdb in C, bgwriter and pg_autovacuum, and pluggable storage manager
> have made the possibility of creation of the Upgrade Subsystem for
> PostgreSQL to be something more reasonable, complete, feasible, and sane
> to a point.
>
>
> Utilities and the User's (DBA) View of the Feature
> --------------------------------------------------
>
> Two instances exist:
>
> pg_upgrade (in C)
>
> A standalone utility to upgrade the binary on-disk format from one
> version to another when the database is offline.
> We should always have this as an option.
> pg_upgrade will accept sub/super set of pg_dump(all)/pg_restore
> options that do not require a connection. I haven't
> thought through this in detail yet.
>
> pg_autoupgrade
>
> a postgres subprocess, modeled after bgwriter and pg_autovacuum
> daemons. This will work when the database system is running
> on old data directory, and lazily converting relations to the new
> format.
>
> pg_autoupgrade daemon can be triggered by the following events in addition
> to the lazy upgrade process:
>
> "SQL" level: UPGRADE <ALL | relation_name [, relation_name]> [NOW | time]
>
> While the database won't be offline running over older database files,
> SELECT/read-only queries would be allowed using older storage managers*.
> Any write operation on old data will act using write-invalidate approach
> that will force the upgrade the affected relations to the new format to be
> scheduled after the relation-in-progress.
>
> (* See the "Storage Management" section.)
>
> Availability of the relations while upgrade is in progress is likely to be
> the same as in VACUUM FULL for that relation, i.e. the entire relation is
> locked until the upgrade is complete. Maybe we could optimize that by
> locking only particular pages of relations, I have to figure that out.
>
> The upgrade of indices can be done using REINDEX, which seems far less
> complicated than trying to convert its on-disk representation. This has
> to be done after the relation is converted. Alternatively, the index
> upgrade can simply be done by "CREATE INDEX" after the upgrade of
> relations.
>
> The relations to be upgraded are ordered according to some priority, e.g.
> system relations being first, then user-owned relations. System relations
> upgrade is forced upon the postmaster startup, and then user relations are
> processed lazily.
>
> So, in a sense, pg_autoupgrade will act like a proxy choosing appropriate
> storage manager (like a driver) between the new server and the old data
> file upgrading them on-demand. For that purpose we might need to add a
> pg_upgradeproxy to intercept backend requests and use appropriate storage
> manager. There will be one proxy process per backend.
>
>
> Storage Management
> ==================
>
> Somebody has made a possibility to plug a different storage manager in
> postgres and we even had two of them at some point . for the magnetic disk
> and the main memory. The main memory one is gone, but the smgr API is
> still there. Some were dubious why we would ever need another third-party
> storage manager, but here I propose to "plug in" storage managers from the
> older Postgres versions itself! Here is where the pluggable storage
> manager API would be handy once fully resurrected. Instead of trying to
> plug some third party storage managers it will primarily be used by the
> storage managers of different versions of Postgres.
>
> We can take the storage manager code from the past maintenance releases,
> namely 6.5.3, 7.0.3, 7.1.3, 7.2.5, 7.3.7, 7.4.5, and 8.0, and arrange them
> in appropriate fashion and have them implement the API properly. Anyone
> can contribute a storage manager as they see fit, there's no need to get
> them all at once. As a trial implementation I will try to do the last
> three or four maybe.
>
>
> Where to put relations being upgraded?
> --------------------------------------
>
> At the beginning of the upgrade process if pg detects the old version of
> data files, it moves them under $PGDATA/<ver>, and keeps the old relations
> there until upgraded. The relations to be upgraded will be kept in the
> pg_upgrade_catalog. Once all relations upgraded, the <ver> directory is
> removed and the auto and proxy processes are shut down. The contents of
> the pg_upgrade_catalog emptied. The only issue remains is how to deal
> with tablespaces (or LOCATION in 7.* releases) elsewhere .- this can
> probably be addressed in the similar fashion, but having a
> /my/tablespace/<ver> directory.
>
> Source Code Maintenance
> =======================
>
> Now, after the above some of you may get scared on the amount of similar
> code to possibly maintain in all those storage managers, but in reality
> they would require as much maintenance as the corresponding releases do
> get back-patched in that code area, and some are not being maintained for
> quite some time already. Plus, I should be around to maintain it, should
> this become realized.
>
> Release-time Maintenance
> ------------------------
>
> For maintenance of pg_upgrade itself, one will have to fork out a new
> storage manager from the previous stable release and "register" it within
> the system. Alternatively, the new storage manager can be forked when the
> new release cycle begins. Additionally, a pg_upgrade version has to be
> added implementing the API steps outlined in the pg_upgrade API section.
>
>
> Implementation Steps
> ====================
>
> To materialize the above idea, I'd proceed as follows:
>
> *) Provide the initdb() API (quick)
>
> *) Resurrect the pluggable storage manager API to be usable for the
> purpose.
>
> *) Document it
>
> *) Implement pg_upgrade API for 8.0 and 7.4.5.
>
> *) Extract 8.0 and 7.4.5 storage managers and have them implement the API
> as a proof of concept. Massage the API as needed.
>
> *) Document the process of adding new storage managers and pg_upgrade
> drivers.
>
> *) Extract other versions storage managers.
>
>
> pg_upgrade sequence
> -------------------
>
> pg_upgrade API for the steps below to update for the next release.
>
> What to do with WAL?? Maybe upgrade can simply be done using WAL replay
> with old WAL manager? Not, fully, because not everything is in WAL, but
> some WAL recovery maybe needed in case the server was not shutdown cleanly
> before the upgrade.
>
> pg_upgrade will proceed as follows:
>
> - move PGDATA to PGDATA/<major pg version>
> - move tablespaces likewise
> - optional recovery from WAL in case old server was not shutdown properly
> -? Shall I upgrade PITR logs of 8.x??? So one can recover to a
> point-in-time in the upgraded database?
> - CLUSTER all old data
> - ANALYZE all old data
> - initdb() new system catalogs
> - Merge in modifications from old system catalogs
> - upgrade schemas/users
> -- variations
> - upgrade user relations
>
> Upgrade API:
> ------------
>
> First draft, to be refined multiple times, but to convey the ideas behind:
>
> moveData()
> movePGData()
> moveTablespaces() 8.0+
> moveDbLocation() < 8.0
>
> preliminaryRecovery()
> - WAL??
> - PITR 8.0+??
>
> preliminaryCleanup()
> CLUSTER -- recover some dead space
> ANALYZE -- gives us stats
>
> upgradeSystemInfo()
> initdb()
> mergeOldCatalogs()
> mergeOldTemplates()
>
> upgradeUsers()
>
> upgradeSchemas()
> - > 7.2, else NULL
>
> upgradeUserRelations()
> upgradeIndices()
> DROP/CREATE
>
> upgradeInit()
> {
>
> }
>
> The main body in pseudocode:
>
> upgradeLoop()
> {
> moveData();
> preliminaryRecovery();
> preliminaryCleanup();
> upgradeSystemInfo();
> upgradeUsers();
> upgradeSchemas();
> upgradeUserRelations();
> }
>
> Something along these lines the API would be:
>
> typedef struct t_upgrade
> {
> bool (*moveData) (void);
> bool (*preliminaryRecovery) (void); /* may be NULL */
> bool (*preliminaryCleanup) (void); /* may be NULL */
> bool (*upgradeSystemInfo) (void); /* may be NULL */
> bool (*upgradeUsers) (void); /* may be NULL */
> bool (*upgradeSchemas) (void); /* may be NULL */
> bool (*upgradeUserRelations) (void); /* may be NULL */
> } t_upgrade;
>
>
> The above sequence is executed by either pg_upgrade utility uninterrupted
> or by the pg_autoupgrade daemon. In the former the upgrade priority is
> simply by OID, in the latter also, but can be overridden by the user using
> the UPGRADE command to schedule relations upgrade, write operation can
> also change such schedule, with user's selected choice to be first. The
> more write requests a relation receives while in the upgrade queue, its
> priority increases; thus, the relation with most hits is on top. In case
> of tie, OID is the decision mark.
>
> Some issues to look into:
>
> - catalog merger
> - a crash in the middle of upgrade
> - PITR logs for 8.x+
> - ...
>
> Flames and Handwaiving
> ----------------------
>
> Okay, flame is on, but before you flame, mind you, this is a very initial
> version of the design. Some of the ideas may seem far fetched, the
> contents may seem messy, but I believe it's now more doable than ever and
> I am willing to put effort in it for the next release or two and then
> maintain it afterwards. It's not going to be done in one shot maybe, but
> incrementally, using input, feedback, and hints from you, guys.
>
> Thank you for reading till this far :-) I.d like to hear from you if any
> of this made sense to you.
>
> Truly yours,
>
> --
> Serguei A. Mokhov | /~\ The ASCII
> Computer Science Department | \ / Ribbon Campaign
> Concordia University | X Against HTML
> Montreal, Quebec, Canada | / \ Email!
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faqs/FAQ.html
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2005-06-04 23:34:52 Re: [HACKERS] PGPASSWORD and client tools
Previous Message Bruce Momjian 2005-06-04 23:28:53 Re: Rollback on Error