Orderly Shutdown Should Be First Priority

Full applications shutdowns do not happen as frequently as in the past. However, the increasing tempo of revisions means that real-time and online systems component shutdowns and restarts are inevitable. Controlled shutdowns and restarts differ dramatically from uncontrolled shutdowns resulting from hardware and power failures. Despite the undeniable reality, many software components lack a pre-planned capability for controlled quiescence and shutdown.

Mechanical devices by contrast have detailed shutdown procedures. Failure to follow the prescribed procedure often leads to physical damage and shortened service life. Examples abound in real life. Shutting down a device is often no longer simply throwing a switch. The “Off” switch starts a sequence of events lasting seconds or tens of minutes that are required to safely deactivate the apparatus. Consider what happens when shutting off an internal combustion engine (ICE) automobile. Often the fuel feed is immediately cut off. However, depending on the powerplant, it might be unwise to immediately shut off lubrication or cooling. In many cases, one can hear the cooling fan running even minutes after the engine is shut off. Accumulated operating heat does not instantaneously dissipate. Orderly cooling prevents damage to internal components.

Video projectors often contain high-intensity light sources that produce significant heat. It is common to see a warning to avoid unplugging the power until several minutes after the light has been extinguished. Pressing the “Off<” switch extinguishes the high-inensity light, however cooling fans continue operation until a timer expires or the internal temperature has decreased to an acceptable level.

While applications and other software components generally lack such physical state, they do have a logical state consisting of open files, network connections, and database connections. An application may also control a physical device that must be placed in a safe state before shutdown.

My decades-long professional life has often involved real-time and online systems. I remain amazed at how often systems and sub-components lack mechanisms for orderly shutdown.

Orderly shutdowns most commonly occur during development and testing. Shutdowns and restarts may also occur due to operational requirements and inevitable changes in parameters, configuration changes, or operational anomalies. An orderly shutdown may also be necessary when an underlying facility or package, e.g., a database, requires update or maintenance.

As an example, updating a database or library package often requires that the database or library be quiescent. This can result in a transposition of the old nursery rhyme:

For want of a nail, the horse's shoe was lost.

For want of a shoe, the horse was lost.

For want of a horse, the rider was lost.

For want of a rider, the battle was lost.

And for want of a battle, the kingdom was lost.

And all for the want of a horseshoe nail.

– George Herbert For want of a nail ^[1]

Therein lies a frequent problem. No component is an island. Components such as databases, are individuals in a complex interconnected software ecosystem. One seemingly inconsequential component’s failure to execute an orderly shutdown can prevent the entire software assemblage from shutting down in an orderly fashion. A classic example is a single open database connection or file can prevent database from properly shutting down or a volume from dismounting. Every client must cooperate fully.

More than thirty years ago while working on a system in New York’s Financial district, I encountered just such a system under development. One of the challenges was the frequent need to reboot the host server for every debugging session. Digging deeper, the underlying cause was straightforward: the primary application made use of a shared common region and the facility to do an orderly reset of the shared data had not been implemented. Forcing termination of the application resulted in uncompleted database operations and a corrupted common region. Each such reset required forced a database recovery, severely impacting debugging turnaround.

Correcting the problem was straightforward: implement an orderly shutdown mechanism. An orderly shutdown allowed the application instances to de- access the common area and the central database. No more reboots; no more database recoveries in normal debugging. Time between successive iterations was reduced from tens of minutes to a handful of minutes or less.

When implementing a system, it is safe to say that failures and restarts are both inevitable and frequent. The controlled shutdown mechanics should be the first task in system implementation. Orderly shutdowns are at their highest frequency during development and testing.

Examples abound. Apache is one well-known example. Apache includes a series of commands to shutdown and restart HTTP services while the underlying system remains in operation.^[2] Externally, web server restart or reinitialization is seen as a momentary interruption in HTTP/HTTPS availability, with as much impact as a glitch in the intermediate network.

The Apache documentation provides a good taxonomy of the possible shutdown and restart scenarios. These are a good starting point for implementing controlled shutdowns and restarts:

Graceful restart
Graceful stop
Terminate immediately
Restart now

The “graceful” scenarios allow presently active requests to complete before the requested shutdown or restart. The immediate cases act immediately.

If the subject services are available through a load balancing device, clearly one should remove the subject end-points from the available pool before initiating the restart. Once a restart has been effected the restarted service can rejoin the active provider pool.

In an ideal world, the dependency network from external services to core foundation is straightforward. Since this is not always the case, it is sound practice to deal gracefully with an underlying service being restarted without notice. If an underlying service is transactional without context between successive transactions, the task is easier.

Shutdowns and restarts are inevitable. Providing for controlled, orderly mechanisms for shutdowns and restarts reduces the hazards inherent in uncontrolled shutdowns.

Notes

^[1]	George Herbert (1640) “For want of a nail”
^[2]	Apache Software Foundation (2023) “Stopping and Restarting Apache HTTP Server”

References

George Herbert (1640) “For want of a nail” Outlandish Proverbs, no. 499
Apache Software Foundation (2023) “Stopping and Restarting Apache HTTP Server” Retrieved from https://httpd.apache.org/docs/2.4/stopping.html on May 10, 2023