Forget Looking Back – Clone and Update Forward

Updates are inevitable. With good planning and preparation updates go smoothly. Without due care, updates can be fraught with challenges. Some simply install without consequence; others uncover long-undetected defects in applications and underlying software layers.

The existing active system disk is commonly updated in place. Present storage availability and costs have changed dramatically. Cloning the existing system disk and updating the clone is a far better option. If the update goes awry or problems are detected later the original system volume remains a fallback.

Planning for updates requires preparing for the worst and hoping for the best. If the update proceeds without incident, all the better. Preparation and planning are never wasted. We should always consider whether techniques from earlier technological eras should be re-evaluated in the context of present and projected technologies.

Clone and update is not without precedent. Though not described in as many words, it is effectively the technique used for updating a shared system volume in an OpenVMScluster while other OpenVMScluster nodes continue normal operations.1 The Installation and Upgrade manual refers to this approach as a rolling upgrade, where a copy of the system volume is updated standalone on one system then used to reboot other OpenVMScluster members in sequence.

Classic OpenVMS upgrades begin with a system disk backup, followed by the update installation. The system is rebooted following successful update installation. If the update was installed on a shared system volume in an OpenVMScluster, a rolling reboot must be installed immediately.2 If the update installation does not complete successfully, or if there are other problems, restore from the backup and restart the system. There is no quick fallback.

The rolling reboot is mandatory when updating a shared system volume in an OpenVMScluster. Files shared by multiple systems have been updated, mixed pre/post update files adversely impact system stability.3 The time pressure of a rolling reboot can lead to problems.

We should reconsider the reboot question from a different perspective. Since we are updating a clone of the original system volume, we can repurpose the approach used for an OpenVMS upgrade in the update context. Rather than the image of a rolling wheel, visualize changing a tread on a tracked vehicle, hence a tread reboot, rebooting one node at-a-time from the updated system volume over a longer interval. During a tread reboot is that BOTH the original and updated system volumes are always usable. There is no danger that an external event, e.g., power interruption, hardware glitch, will result in an uncontrolled adoption of an update. Tread reboots remove the time constraints imposed by a rolling reboot using an updated-in-place volume.

There is never an interruption in operations if an update fails. No restore is necessary; the original system volume remains available for use on a moment’s notice. A server can revert to the previous system by a simple reboot.

The instructions for the VMSINSTAL.COM command procedure and the PRODUCT INSTALL command, describe the process as:

After the update has been installed, experience with the updated system may uncover additional problems.

Qualifying an updated system for production use is often an extended process. Qualification does not guarantee that problems will not emerge later. Some applications and underlying packages may operate correctly, others may not. The probability that a problem will be discovered declines over time. The probability of problems is not the same for all workload components.

Mixed-version and mixed-architecture OpenVMSclusters are common. Mixed- version OpenVMSclusters offer flexibility and leverage. Clone and update leverages mixed-version OpenVMSclusters to reduce risk and downtime.

If your OpenVMS system is not part of an OpenVMScluster, the only Alpha processor in the OpenVMScluster, or in the future, an x86-64, an emulator or virtual machine can be used to do the work of the update, with the updated system volume being transferred to the storage array for production use.

The sequence of backup, update, and possible restore has serious drawbacks. Restoring after a failed update takes time and is not without hazard. Overwriting the partially or fully updated system image destroys the updated disk, the evidence of the problem, and condemns the system manager to redo the same work at a future time.

The update procedures were written long ago, in a far different computing and storage landscape. We should always reconsider our update practices in the present landscape. The present landscape makes clone-update a highly attractive and cost-effective alternative.

Software environments are not limited to operating systems. Most sites have a portfolio, sometimes vast, of layered products. Each product may need to be requalified after an update. This reality increases the possibility that any qualification is inherently conditional and incremental.

We start with the upgrade sequence itself and examine how clone and update reduces costs, delays, and risks. Our goal is an update process with less risk, together with a faster and safer regression strategy. These concerns apply whether problems occur during the update itself or are detected later.

Traditionally, the primary precaution against problems is a backup of the system volume taken before installing the update. When mass storage was expensive, e.g., the 1980s, there was little choice. Disk drives were expensive, extra drives an unimaginable luxury.

Interruptions in system availability caused by a restore operation have serious consequences when systems operate 24x7x366.

The backup-update-restore sequence dates from that earlier technological era.

Multi-terabyte (TB) disk drives now weigh less than a pound, fit easily in a hand, and often cost well less than US$ 1,000/each, a cost/bit reduction of approximately 175,000x. It is now unusual to allocate an entire physical disk volume as a system volume. Rather, multi-terabyte physical volumes are connected to storage arrays or controllers which carve physical space pools into far smaller logical volumes. A space pool may be a single physical disk drive, or multiple physical drives combined into a striped (RAID0), mirrored (RAID1), n+1 redundant (RAID5), or other approach.

OpenVMS system volumes (VAX, Alpha, IA64, or x86-64) comfortably fit on a 3GB logical volume.

The contemporary mass storage environment together and OpenVMS logical name facility are symbiotic. Cloning a 5GB logical volume takes mere minutes at present mass storage speeds. Cloning the system volume followed by updating the clone produces an updated system volume. The original system volume is unchanged and stands ready as a fallback.

The SYS$SYSDEVICE-rooted system logical names disconnect the system volume location from its role as a system volume. A good guideline is to never use any name other than the SYS$SYSDEVICE-rooted names when referencing the system volume. The SYS$SYSDEVICE and related names are defined in VMS$INITIAL-050_VMS.COM based upon the bootstrap device.4

Following that rule, there is no operating difference between $1$DGA1, $1$DGA786, DKB500, or other volume. System volumes are all fungible.

Reboot one server from the clone following the instructions in the relevant Release Notes. One may use a conversational bootstrap with STARTUP_P1 set to "MIN".5 Install the update on the cloned system disk. The original system volume remains pristine. If running an OpenVMScluster, normal operations can continue uninterrupted on the other nodes in the OpenVMScluster. If running a single OpenVMS system, retreating to the pre-update is a quick shutdown and a reboot, a matter of minutes. Modify SYS$MANAGER:SYLOGICALS.COM on the cloned system disk to point at the original cluster-wide system data files, e.g., SYSUAF.DAT, RIGHTSLIST.DAT.6 Note that at this point a single server is running using the updated system volume with no access to cluster-access mass storage.

After the update has been installed, verify that the updated system volume is undamaged:

In an OpenVMScluster, the OpenVMScluster is now operating as a mixed? version or mixed-update OpenVMScluster, with most of the nodes running the pre-update OpenVMS from the original system volume and a single, possibly virtualized, node running the updated version of OpenVMS from the updated cloned system volume.

The updated OpenVMS system volume can be validated at leisure. As confidence in the update increases, additional servers can be bootstrapped from the updated system volume. When the updated system volume is fully qualified, all remaining members of the OpenVMScluster transition to the updated system volume a small number at a time. The original system volume is still used as a quorum disk and the repository of cluster-wide shared files, as well as a fallback should it be needed.

The cautionary note in the OpenVMS Update Release Notes that one should cycle though a rolling reboot does not apply.7 The cautionary note relates to updating an active shared system volume. With clone-update, the pre-update system volume remains unchanged. Only one node was using the clone when it was being updated, and it was rebooted following the update.

This is the same changeover as a rolling upgrade, without the time pressure. Unlike a big bang cutover, if there is a question, time can be taken to clarify the issue before proceeding. In larger 24x7x366 environments, the less pressing time scale enables significant flexibility and reduces risks.

If a problem does occur, reverting to the previous version is simplicity itself. Shut down the affected system(s) and bootstrap from the un-updated system volume.

Reviewing an actual example illustrates the benefits.

Consider a mid-size OpenVMScluster with a modest, less than 10 TB storage array.

The storage array provisions logical volumes:

The first step is to clone $1$DGA1 onto $1$DGA2 using BACKUP/IMAGE. In this context, one is not worried about log files which are opened for write.

Mount $1$DGA2 privately and reset the volume label.

MOUNT $1$DGA2 I64VMS842L3
SET VOLUME $1$DGA2:/NAME=I64VMS842L3B
Since the system volume was cloned, the Quorum disk and OpenVMS cluster group number/password are already correctly set.

All node-specific system root directories, e.g., [SYS0], [SYS1], …, [SYSn] will be present on both $1$DGA1 and $1$DGA2. OpenVMS system parameter file(s) are properly set.

All references to the system device should use one of the SYS$SYSDEVICE- based logical names, so it is a simple matter to bootstrap a single server using STARTUP_P1 set to "MIN" from the clone, e.g., $1$DGA2 in the example.8 Following the MIN bootstrap, the server will be running as a member of the OpenVMScluster. The update can now be installed.

Once the update has installed, examine MODPARAMS.DAT. In some cases, OpenVMS updates/upgrades may modify MODPARAMS.DAT, necessitating a re-editing of MODPARAMS.DAT. After verifying MODPARAMS.DAT, run AUTOGEN and reboot the updated/upgraded system volume, initially using STARTUP MIN, and later with the standard bootstrap. Verify that VAXCLUSTER is set correctly.

When the new system volume has been validated, a treaded reboot switches other members of the OpenVMScluster to the new system volume. The benefit of a tread reboot is that a tread reboot eliminates the time pressure of a rolling reboot.

Done properly, this procedure limits downtime to a matter of minutes per server — the time required to shutdown and reboot each node. If nodes are set up appropriately, with at least two members providing each service, near 100% application uptime is achievable.

Working with standalone OpenVMS systems incurs more downtime, but clone- upgrade results in less downtime risk than the classic backup-update- restore sequence. If no second processor is available, downtime can be significantly reduced by employing an Alpha emulator (or in the future, an x86-64 virtual machine) to perform the actual update), either directly using a disk on the SAN or by transferring the disk to the emulated/virtual machine environment and back. If a fallback to the original system volume becomes necessary, the system can always be restarted from the pre-update system volume. This tread reboot is a gradually inclined ramp; fast rolling reboots are effectively a cliff to be scaled. If a problem is detected normal operations can be severely impacted.

Controlling instantaneous risk through eliminating cold cutovers reduces the potential for significant problems.

Notes

[1] VMS Software (2017, June) VSI OpenVMS Alpha Version 8.4-2L2 Installation and Upgrade Manual Section 5.5.2 Rolling Upgrade
[2] VMS Software (2018) VMS842L1_UPDATE-V0100 ECO Kit Release Notes, Section 2.2 Reboot Requirement
[3] ibid
[4] Gezelter, R (2020, December 3) OpenVMS STARTUP: Underappreciated Flexibility; [VMS$COMMON.SYS$STARTUP]VMS$INITIAL-050_VMS.COM
[5] VSI (2020, June) VSI OpenVMS System Manager's Manual, Volume 1: Essentials, Section 4.5.3. Booting with Minimum Startup
[6] [SYSMGR]SYLOGICALS.COM
[7] VMS Software (2018) VMS842L1_UPDATE-V0100 ECO Kit Release Notes, Section 2.2 Reboot Requirement
[8] [VMS$COMMON.SYS$STARTUP]VMS$INITIAL-050_VMS.COM

References

URLs for referencing this entry

 
 
Picture of Robert Gezelter, CDP
RSS Feed Icon RSS Feed Icon
Follow us on Twitter
Bringing Details into Focus, Focused Innovation, Focused Solutions
Robert Gezelter Software Consultant Logo
http://www.rlgsc.com
+1 (718) 463 1079