Multics Technical Bulletin MTB-635 Disk Volumes To: Distribution From: Benson I. Margulies Date: 10/19/83 Subject: Management of Large Physical Disk Drives -- an Overview 1 ABSTRACT This MTB describes a new design for the management of physical and logical volumes, intended to support larger devices than can be supported today. This effort is necessary to offer a better administrative interface on 3380 class devices, and to support the next generation of disks after that at all. Readers of this MTB should be familiar with the existing disk DIM, physical volume management, and logical volume management subsystems, at least in broad outline. Comments should be sent to the author: via Multics Mail: Margulies.Multics on either MIT Multics or System M. via telephone: (HVN) 261-9333, or (617) 492-9333 or via the >udd>m>meetings>Disk_Support.forum (disks) forum meeting on System M. _________________________________________________________________ Multics project internal working documentation. Not to be reproduced or distributed outside the Multics project without the consent of the author or the author's management. Multics Technical Bulletin MTB-635 Disk Volumes 2 INTRODUCTION There are two problem areas in storage system disk volume management. The first is administrative. When storage system volume management was designed for NSS (the new storage system), disk drives were small. The goal was to relieve administrators of the need to divide up their storage systems into pools of quota which would fit on a DSU190. The solution was to construct logical volumes out of multiple physical volumes. Today, the situation is reversed. Typical disk drives are so big that they cannot be assigned to a single functional pool of quota. The administrative problem is further complicated by FAMIS, which plans to offer a disk access method that does not use storage system segments. To dedicate an entire 3380 device to a database, or even a series of databases, is an unreasonable restriction. Further, all of the drives available when volume management was designed had removable packs. The changes made to support the 5xx devices made the least changes possible to handle non-removable packs. Now, when more and more sites have only non-removable packs, the assumption of removability makes for an unreasonably clumsy administrative and operational interface. The second problem is one of implementation. Within the supervisor, storage system records are addressed by record number within the physical volume. There are currently 17 bits of record number available in several crucial data structures. This is not enough to represent all the records on the next generation of disk drive after the 3380. A design is needed that not only finds enough bits for this new generation of disk drives, but does not require us to repeat the exercise every few years. This MTB offers a description of the issues involved in redesigning disk support, and an initial overview of a suitable design. Once these general features of the design are agreed upon, future MTB's can address the details. Tom Oke (of the U. of Calgary), has made an extensive study of the design of the disk DIM, proposing (and implementing) an alternative strategy for queueing and seek optimization. The redesign of disk support proposed here is the appropriate framework for an implementation of Oke's design. His paper on the disk DIM is attached to this MTB as an appendix. MTB-635 Multics Technical Bulletin Disk Volumes 3 ISSUES TO BE ADDRESSED The preceeding section listed the problems with disk support that compel a new design. This section shows those issues in a little more detail, and also describes some other, less urgent, problems that can be addressed at the same time. 3.1 The Disk DIM The disk DIM is the lowest layer of disk support. Its contract is to take a Physical Volume index (PVTX) and a Multics record number or sector number and do the necessary I/O to read or write it. It is the only program with knowledge of the translation of a PVTX to a device number and of a record number to a sector number. It is responsible for seek optimization and error recovery. The following subsections describe changes that should be made in this area: 3.1.1 MORE SECTORS The disk DIM's queue entries currently are completely packed, and have 22 bits available to record a sector number. They should be restructured to have enough sector bits to cover the next few generations (doubling in capacity) of disk drives. Note that the disk queue is only used by the disk DIM, so that a occasional change in format to get more sector bits is not unreasonable. 3.1.2 BIGGER SECTORS All Multics storage system disk I/O is currently done in 64 word sectors. This has a number of problems. First, it inflates the number of bits needed to describe the desired sector. Second, on writes it requires the controller hardware to do a read-alter-rewrite sequence. It is not clear that we can depend on this feature being available in the indefinite future. Note, though, that to use bigger sectors we will have to reorganize the VTOC. Thus while support for 512 word sectors should be added to the disk DIM, support for 64 word sectors may not be removable just yet. 3.1.3 BETTER QUEUEING AND LOCKING STRATEGY Tom Oke's paper describes this issue in complete and glorious detail and is attached to this MTB. Multics Technical Bulletin MTB-635 Disk Volumes 3.1.4 DIFFERENT HARDWARE CONNECTABILITY IBM 3380 subsystems will not support the current Multics concept of a subsystem, since the paths to disk drives are differently organized. To get full performance out of these disks we have to make our configuration and seek optimization more complex. 3.1.5 ERROR RECOVERY The existing error recovery strategy has three problems. First, it does not have knowledge of controllers, adaptors and logical channels. All it can do in a "bad path" error is delete the channel, which may not solve the problem. Second, it can print tremendous volumes of messages on the system console, which brings the system to a standstill. Third, its handling of offline disks is primitive. If a process detects a disk offline while it has a crucial paged system lock locked, it will hold the lock until the disk comes back online. 3.1.6 FAMIS SUPPORT The disk DIM currently multiplexes disks amongst two access methods, VTOC I/O and Paging I/O. A third is being added for Bootload Multics, though it is effectively just Paging I/O with a different posting mechanism. The first design decision to be made is how to support further access methods. One possiblility is a general scheme resembling io_manager, in which each access method would register itself, presenting an interrupt procedure and receiving a handle. The other possibility is to add each access method to the disk DIM individually. While the second approach is less modular, it allows seek optimization and the like to take the particular access method into consideration. With either design, it is important that access methods be able to make effective use of prior knowledge of their access patterns, by reading ahead or pre-seeking. 3.2 Page and Segment Control In page and segment control, there are two issues that will be addressed. First, any change in the VTOC format to use 512 word sectors will be reflected here. Second, the problem with offline disks described above must be addressed here as well as in the disk DIM. MTB-635 Multics Technical Bulletin Disk Volumes 3.3 Disk Administration 3.3.1 DIVIDING UP STORAGE INTO SMALLER CHUNKS A 3380, or even a 451, is too large a chunk of disk to be assigned to an autonomous administrator in many circumstances. It is necessary to allow sites to dedicate portions of a disk to different uses. 3.3.2 THE ADMINISTRATIVE INTERFACE The current "three ring circus" command (xxx_volume_registration) are clumsy, confusing, and only work in the initializer process. We will design something far easier to use. 3.3.3 "SWEET SPOT" ALLOCATION A study has shown that 10% of the disk storage at a typical site accounts for 90% of the disk accesses. This suggests that it would be worthwhile to allocate the per-process information (the 10%) in the middle of the disk by default, and the permanent information on the outside. A more sophisticated strategy would be to try to automatically migrate segments back and forth between "high use" regions and "low use" regions. Note, also, that the VTOC is a popular region, and would benefit from this treatment. 3.3.4 ALTERNATE TRACK MANAGMEMENT This area effects the disk DIM and page and segment control as well. Current support of alternate tracks is minimal at best. It is excruciatingly difficult to clear all of the data off of a track so that an alternate can be assigned. We make no effort to automate the process of detecting a failing track, inhibiting allocation of new pages on it, and moving existing pages elsewhere. As disk drives get bigger and fewer, requiring sites to do extensive tape saving in order to do routine maintenance becomes less and less reasonable. 4 THE PROPOSED DESIGN This section is a sketch of a proposed design, followed by a discussion of resource requirements and phasing possibilities. Multics Technical Bulletin MTB-635 Disk Volumes 4.1 Disk I/O The disk DIM will be reimplemented to use Oke's queueing strategy and address the other issues described above. 4.2 Volume Organization and Management There will be a new layer of organization in volume management. Physical volumes will be divided into one or more "logical regions", and logical volumes will be constructed from logical regions, rather than physical volumes. Record addresses that are currently interpreted as offsets within a physical volume will become offsets within a logical region. A logical region may be administratively assigned to some storage system logical volume, or to some other disk access method, such as FAMIS. The change will be largely transparent to page and segment control. They will continue to work with (pvtx, vtocx, record number) addresses, and the disk DIM will map the PVTX to the correct physical device. Bad track information will be maintained, so that records for bad tracks can be assigned to place-holder vtoces until an alternate can be assigned. Track formatting algorithms will be coded in Multics (rather than just in T&D) to allow automatic alternate assignment. 4.3 Administrative Interface Disk administration will be a subsystem accessible in any administrator's process. The design goals of this subsystem are to make it easy to specify the usual layout of a disk pack, while offering the option of more complicated, specialized cases. In particular, any limitation in the maximum size of a logical region will be hidden by the administrative software. If an administrator requests a logical region too large for the current software, multiple logical regions will be defined transparently. A video system application will be used to make the administrative subsystem easy to use. A graphical representation of the disk pack will be shown in one window while requests to change the layout will be accepted in the other. 4.4 Volume Dumper The volume dumper and reloader will be made knowledgeable of the logical region strategy. Further, the volume dumper will start MTB-635 Multics Technical Bulletin Disk Volumes dumping information in the pack other than VTOCE's and records, like bad track lists, partitions, and the line. 4.5 Disk Format This design requires changes to the disk layout. The obvious change is to replace the partition map with a map of defined logical regions. However, this more complex organization of a pack will make it more vunerable to damage to the label. Therefore, the basic label will be recorded in several places on the pack. This will reduce the chances of data loss. To avoid another flag day like that for record stocks, the current disk format will be supported for several releases after the new format is released. 5 IMPLEMENTATION CONSIDERATIONS This design is clearly more than we can implement in a single release cycle. Even if the resources were available, the debug and qualification effort involved in reimplementing all these things at once would be tremendous. The design divides up into a number of disjoint phases. 5.1 Disk DIM Reimplementation if the disk DIM is a self-contained project. Any changes to its interfaces can be trivially tracked in its callers. However, the new features should be thoroughly tested with test stubs rather than waiting for the following phases to exercise them. When this phase is done, the system will: * have far better disk performance (see Oke's paper), * be able to support the FAMIS access method (for an entire disk), * will have better error recovery, * and be able to support 3390 drives in an interim compatability mode in which Multics software splits each 3390 actuator into two "devices." This project is a person-year, assuming that 4 to six months of Tom Oke are available for the disk DIM proper. Multics Technical Bulletin MTB-635 Disk Volumes 5.2 New Volume Format I In this phase, the new volume layout is used to transform the current partition strategy into a set of logical regions. All of hardcore logical volume management is converted to the logical region design. The "three ring circus" is gutted. However, the current administrative interface is preserved. Multiple logical regions on a device are not yet supported. This is a one person-year project. 5.3 New Volume Format II In this final phase, the administrative interface is implemented. Automatic creation of logical regions is implemented. Multiple logical region support is announced. This is a 3 month project. Note that these phases do not necessarily correspond to release boundaries. In particular, the latter two phases may well be in a single release. The time estimates here are generous, allowing for a good deal of test exposure, qualification, and documentation. MTB-635 Multics Technical Bulletin Disk Volumes Disk System Modification MTB Tom Oke January 12, 1983 ABSTRACT This MTB describes a three phase modification to the existing Multics disk management system. The result of these modifications will be a net decrease in the processor and system overheads necessary to manage the disk system, and a net increase in the throughput and responsiveness of the computer system as a whole. The total project is broken into three sub-phases. Each sub-phase is necessary to supply groundwork and background upon which to base the next phase, but each phase has its own goals and benefits. It is possible to halt the project at completion of any phase and have a functional system to that level of support. Phase One removes fixed limitations from the disk sub-systems, provides better utility with the same resources, and permits full utilization of the channel capabilities of the hardware. Phase Two reduces locking and queueing overheads, permits on-cylinder optimizations, and better metering and statistics. Phase Three introduces an efficient dynamic system optimizer aimed at optimizing total system resources as they apply to the storage system to achieve better system throughput and responsiveness and to dynamically manage these resources according to site defined optimization desires (effectively stated as simple desire rules). This MTB outlines the conceptual basis for these design modifications, and the expected benefits of each phase of the project. Each phase is outlined in terms of reason, cost, benifit, and expended manpower. Much of the work necessary to implement PHASES ONE and TWO has already been done on an MR7 level of the disk system software, but this would have to be forward-fitted to the MR10.2 level in order to abe useful. All work necessary to implement the changes could be done on the Calgary system and then moved to Phoenix for final integration checkout. PREMISE The basic premise of these modifications is that two general forms of storage access operations exist: blocked and un-blocked IO. Blocking IO occurs when either a user process, or the operating system must wait for the completion of physical activity to be able to continue execution. Un-blocked IO occurs Multics Technical Bulletin MTB-635 Disk Volumes with situations in which process actions are not dependant upon the completion of IO operations, a normal occurrance for things like page writes, and VTOCE writes which are bufferred and not directly requested by a process. For example, a process becomes blocked when it makes reference to a page which is not in main memory, and causes a demand page read. In this case the process must cease execution until the page becomes available. The operating system typically will encounter blocking situations only when its paging system processing resources are saturated and it must wait for the completion of some physical activity before resources become freed and the system can continue. ALLOCATION LOCKS are a typical example of paging system saturation. In this situation the system cannot do ANY paging activity until the allocation lock situation is cleared and queue resources become available, so at the point where there are no processes left to execute which do not need pages, the system becomes idle. Un-blocked IO typically occurs if a queue of IO is output to the storage system which is not a dependancy of any process and can continue without causing any process to block. The danger lies in the transformation of un-blocked to blocked IO when the queue resource becomes saturated, as occurs with ALLOCATION LOCKS, INTERRUPT LOCKS, RUN LOCKS or the attainment of the WRITE-LIMIT threshold. This is particularly important in that it is possible to shut down operations of the entire system, rather than individual processes. In alleviating these situations the basic intent is that un-blocked IO can be ignored, since it does not halt a process or processor, until sufficient system resources become consumed that the un-blocked IO may potentially be turned into blocking IO due to saturation. It is further seen that there are levels of blocking, for example a VTOC demand read is a primary block, since it must occur before a page demand read can possibly complete. A VTOC write is of slightly lesser importance from the viewpoint of blocking processes, but is important to the consistency of the storage system in case of failure. A page demand read is more important than a page demand write, since it causes a process to become ineligible for execution until the page is in main memory. A page demand read may well be more important for system response than a VTOCE write, but this must be balanced with file system consistency and the loading of the VTOC buffers. MTB-635 Multics Technical Bulletin Disk Volumes Such optimizations have an effect, not only upon the storage system, but upon the efficiency of the operating system itself. If there is a high degree of blocking occuring, this must be countered by a high level of multiprogramming in order to attain sufficient executable processes to statistically fill the available processor cycles. A high level of multiprogramming directly translates into a higher level of system overhead in process management, queue searching and scheduling, and process switching. To this end a system works most efficiently if it has the minimal degree of blocking possible, and further the users see a more responsive system when blocking is minimized. PHASE ONE - Alleviation of Allocation Locks, Elimination of Channel Limits FEATURES: Considerable reduction in ALLOCATION LOCKS over existing system functionality, makes queue resources site tunable, provides better utilization of the same amount of queue resources, permits site declaration of a variable number of disk channels per sub-system, permits declaration of a sub-system to be only a declaration of physical connectability, and not skewed to attempt to optimize allocation of queue or channel resources. PROBLEM Allocation locks are a problem which has consistently plagued a loaded Multics system, this is aggravated by unequal disk sub-system loadings. Many sites have sub-split disk sub-systems in an attempt to alleviate this problem by introducing more queuing resource, but this has lead to channel allocation and connection problems. In addition there is a fixed limit of 8 logical channels to each disk sub-system, in current large disk configurations this limits the level of seek overlap which can be maintained and limits degraded configuration capabilities by failing to permit full exploitation of one of the nicer features of the HIS hardware. CAUSE The Multics disk system is split into one or more sub-systems. Each sub-system is given a queuing resource which is sufficient to hold 64 IO requests for the collection of drives Multics Technical Bulletin MTB-635 Disk Volumes which may be attached to that sub-system, and is limited to 8 channels through which to access all the drives. Due to the burst effect of page writes, as a large number of writes are emitted in cleaning up modified pages, it is quite common for a large number of requests to be queued to a single drive, which may be sufficient to saturate the queue resource for the entire sub-system. When this occurs, the IO for an entire sub-system waits for the completion of IO for a single drive, a high degree of bottlenecking. In addition there will probably be a number of writes as yet un-emitted at the point of blockage, which will simply extend the period of blockage, since these will be emitted as soon as space in the queue is available for them. In addition to being a queue resource, the sub-system also describes the physical makeup and connection of the paths (channels and drives) with which to utilize the storage system resources. By being bastardized to allocate necessary queue and channel resources according to the drive loadings the connection problem is made more complex and connection and use is no longer straight-forward, and in some connection cases it is made more error prone. CORRECTION The first phase of the disk project is to reduce or alleviate these problems, and further to make the queuing and channel resources site tunable parameters to account for the varying characteristics of individual site workloads. The primary cause of allocation lock problems is the allocation of free queue resources as a fixed function of a disk sub-systems. Since drive loadings are typically a function of logical volume content and transitory system loadings the corresponding queue loadings are also typically transietory. The allocation of a fixed resource to a dynamic load, and the constraint of the size of that resource takes its toll in system operation. In most cases where ALLOCATION LOCKS occur only one sub-system, and sometimes only one drive, is significantly loaded, this can be seen as a poor utilization of resources. The first phase of the disk project removes the 'free_q' in each sub-system structure, and makes it a system wide resource, with the number of queue elements being a CONFIG DECK tunable parameter. Thus if a site requires a larger queue resource it can be created at boot time with a CONFIG DECK parameter change. By the removal of the queue resource limit as a function of sub-systems one is now able to express the true connectability of MTB-635 Multics Technical Bulletin Disk Volumes a sub-system in its definition without clouding the issue through attempt to alleviate a completely distinct problem. Another problem exists in the fixed limit of eight disk channels per disk sub-system. Since in the HIS hardware there can be as many as eight disk channels per Physical Link Adaptor (LA) this does not permit full exploitation of either full disk seek overlap, nor the hardware limits without mis-declaring the disk sub-systems to being sub-sets of the true connectability. The first phase of the disk project also makes this channel table a configuration dependant table, which can grow to the size necessary to handle all the channels configured logically to the MPC's which connect a set of disk drives. Thus if more logical channels can be configured according to the physical makeup of the IOM/MPC combinations, then they can also be configured in the software to the sub-system which defines the drives they connect. This will be able to typically increase the level of seek overlap, and hence the sub-system IO throughput rates, as well as permitting over-comittment of channels to a sub-system (more channels than drives) to handle degraded operation situations without a corresponding degradation of service capabilities. AFFECTED ROUTINES The following list of routines may be incomplete, it is from MR7 information and has not been updated as yet to the necessary level of an MR10.2 baseline. Routine Function ------- -------- dskdcl.incl.(alm pl1) Defines the data structures involved. get_io_segs.pl1 Initializes size of disk_data for queue so segment can be wired. disk_init.pl1 Generates and initializes queue entries according to CONFIG DECK parameters. dctl.alm ALM disk driver, must manage queue allocation and free. disk_control.pl1 PL/I disk driver, must manage queue allocation and free, and situations of ESD. EXPECTED EFFECT OF MODIFICATION Multics Technical Bulletin MTB-635 Disk Volumes It is expected that the effect of the modification will be a great reduction in ALLOCATION LOCKS even in the busiest system. It will further permit tuning of the 'free_q' resource according to the requirements of a site, and full declaration of the hardware channel capabilities of the hardware. Since the 'free_q' resource is no longer tied to a single sub-system there will be much better utility obtained per queue element allocated, and one will not have to sub-split disk sub-systems to attempt to allocate more queue resource. Since one is no longer limited in the declaration of channels to a sub-system, there will no longer be a need to sub-split sub-systems to permit sufficient channels per drive. Thus a sub-system will be a true declaration of the connectability of disk drives. EXPECTED COST OF MODIFICATION It is expected that this modification will take roughly a man-week from its present state to be ready for testing. A further week should be allocated for testing and production of statistics either confirming or denying the above expectations, and quantifying the actual results. REQUIRED TESTING To be valid testing should include ALL expected, and emergency, disk situations. A system should be load tested both in the current state, and after modification to determine bottleneck points. Further testing should include shutdown situations, of normal shutdown, ESD and shutdown situations in which an MPC or drive failure is sponsored to test robustness of the modifications. Testing should also include salvaging situations on system startup and should include salvage of both the root and public drives. PHASE TWO - Modification of queuing and locking FEATURES: Reduction of queue scanning overheads, reduction of lock contention and delays, better sub-system metering and statistics, on-cylinder seek optimizations under all situations. MTB-635 Multics Technical Bulletin Disk Volumes PROBLEM The current method of locking and the grouping of all requests into two common queues artificially constrains access and increases overheads in management of disk sub-systems. CAUSE The current method of queuing and locking of sub-systems uses a pair of queues per disk sub-system, with a common lock controlling access to the entire sub-system. This has the effect of creating an artificial bottleneck with the constrained access through the lock to all functions to be performed on the sub-system. This lock is necessitated by the use of two queues, common to all drives of the sub-system, a high priority queue and a low priority queue. CORRECTION There are two situations to correct, but one is dependant upon the other. The correction in this case is to make a separate queue for each drive, rather than a set of queues for the entire sub-system. This reduces the number of requests which get scanned to determine a nearest-seek candidate for IO on a single drive. Further, by eliminating the complete separation between the high priority IO queue and the low priority IO queue within the sub-system, and combining them into a single queue per drive, we will be able to optimize situations of on-cylinder mixes of high and low priority requests. One method of retaining the logical separation between high priority requests (which should be done first if they require a head seek) and low priority requests (which are non-blocked) is through the use of a multiplier to make low priority requests look much longer than high priority seek requests. A normal nearest-seek physical seek-length is calculated, and transformed into a logical seek length by multiplying it by the separation factor for that type of IO. Another method is through the use of a seek offset, to increase the logical length of a seek to a low priority IO request. Using a multiplier of the same value as the number of cylinders on the spindle will give complete separation between high and low priority seeks, while retaining on-cylinder optimization. Using an offset of the same value as the number of Multics Technical Bulletin MTB-635 Disk Volumes cylinders on the spindle will give exactly the same effect as having completely separate queues for high and low priority IO and will lose on-cylinder optimization. Once the common queue has been split into queues per drive, then the locking strategy becomes simpler. Each sub-system will then consist of three resources: 1. The drives of the sub-system. These contain the actual requests which need to be done and are a description of the requests for the drive complete in themselves. They are not dependant upon acquiring any other sub-system resource. 2. The channels of the sub-system. These contain access to the physical path necessary to actually effect the IO. These are a simple blocking mechanism. Once a free_q element has been acquired, and a drive has been acquired, then a channel is requested. If it is not available the IO is simply queued (as it is if the drive is doing IO). The operating system does not block due to unavailability of a channel. 3. The metering base per sub-system. This consists of two parts: a. The immutably updateable meters for the sub-system. This will be meters updated with simple immutable instructions (i.e. 'aos') which implicitly lock through the processor/memory access. b. Meters requiring locked instruction sequences, such things as timers and error counts which cannot be updated with immutable instructions, and situations which require a sub-system wide lock. Typically this lock access will be straight-line and will not control physical resources which could introduce realtime delays. In situations requiring the sub-system-wide resource, such as channels and drives, the locking procedure would be to lock the sub-system lock, then the individual drive locks. In this manner one would not require any individual wait on the normal sub-system lock. AFFECTED ROUTINES The following routines will be affected by this level, in addition to all the routines already affected by PHASE ONE, as listed above. MTB-635 Multics Technical Bulletin Disk Volumes Routine Function ------- -------- disk_queue.pl1 A metering routine to check disk queue loading and channel use. disk_meters.pl1 A metering routine to provide sub-system use statistics. It will need updating for the metering structure changes. device_meters.pl1 As for disk_meters. spg_fs_info_.pl1 As for disk_meters. ioi_assign_disk_channels.pl1 Changes for channel locking etc. EXPECTED EFFECT OF MODIFICATIONS These split-outs will decrease the throughput demands upon the individual locks and will reduce situations which require realtime based lock delays. The net effect of the combined modifications will be faster locking, reduction of locking overheads and queue management overheads, and the introduction of on-cylinder IO optimization. Further, moving some of the meters which are really drive specific, rather than sub-system specific, into the drive structure will lessen sub-system lock requirements, and introduce new statistical collection situations which will provide better metering and meterability of disk sub-systems and drive bottleneck situations. EXPECTED COST OF MODIFICATIONS To do a thorough job of this portion of the modifications there should be some consulation with the current CISL developers and a definition of possible interactions of the modifications with future system development and planning. This would take in the range of one to two man-weeks to complete the modifications of this stage and to verify none of the error recovery/reporting functionality has been lost. Testing should cover the same basic range as for PHASE ONE, attempting to validate the modifications in all possible and impossible situations. In addition there should be again a thorough collection of statistics, this time basing against a normal system, the PHASE ONE modified system, and the PHASE TWO Multics Technical Bulletin MTB-635 Disk Volumes system. This will provide a metering base to track the sucess of the design and design criteria. It will also provide information to be finally placed into site tuning documentation. PHASE THREE - Adaptive optimization of IO PROBLEM The optimization of disk IO keeps a complete separation of high and low priority requests. This produces bottlenecking of IO optimization, which has in the past produced ALLOCATION LOCK problems. Even with PHASE ONE and TWO modifications major burst IO characteristics will remain unoptimized for situations requiring throughput, by always favoring attempting best demand IO response. While favouring best IO demand response is a desireable characteristic in disk system management, the blindness by which it is followed sometimes produces inoptimal system response to storage system demands, and requires higher multiprogramming to attain full system efficiency. Further, it does not necessarily produce maximum user/system responsiveness, since it ties up large amounts of memory to hold buffers for the poorly optimized IO. CAUSE The current method of disk optimization is invariant to the changing demands of the system. As a result it is an attempt at a fixed general solution to a rapidly changing dynamic situation. VTOCE IO is highly optimized since it is important, but there is no differentiation of its importance from the viewpoint of the computer system, rather than the disk system. Thus the demand read characteristic of VTOCE read is no better optimized than the lower priority VTOCE write. (In this case the priorities are on a blocking basis, rather than the requirement for a consistent IO system.) The current complete separation between page read and write will be somewhat optimized by PHASE ONE and TWO changes, but there will still be no method to improve the IO throughput optimization of disk page writes as the IO system starts to saturate with them. Thus ALLOCATION LOCKS are still possible. If they are avoided by simply increasing the size of the queue, and WRITE_LIMIT, the system will degrade by the removal of a high MTB-635 Multics Technical Bulletin Disk Volumes degree of available pages. In other words the inability to respond the changing requirements of a storage system as the inherent priority situations change will increase system overheads and delays beyond what is necessary. CORRECTION The final stage of the disk system modifications is termed ADAPTIVE DISK optimization. This optimization depends upon a site setting tuning parameters which define the site's view of the importance of two situations on an IO type by IO type basis. The two situations are: 1. Maximum response. This is the degree of optimization to give to IO requests of this type in situations where there is no IO of this type queued up. It essentially defines the importance of doing this IO with respect to any other IO without regard to queue loadings. 2. Maximum throughput. This is the number of IO requests of this type which can be allowed to queue up to which the system should respond with the maximum possible throughput optimization. It essentially defines a limit of resource allocation at which the system must protect itself by attempting to speed the throughput of IO requests of this type to clear the queue. Though these two values are simple points, they are taken as the definition of a straight line which determines the optimization to be afforded IO requests of each IO type at any point within the x-y space of optimization and queue loading. NOTE that these optimizations are per IO type and IO types do not necessarily have a relationship to each other. The optimization value assigned is a multiplier and is the inverse of the degree of optimization. Where PHASE TWO separated the page read and page write IO's by weighting the physical seek length by a multiplier to make it a logical seek length, this modification simply makes this weighting factor a function of the queue loading and the desired initial optimization. The use of a logical seek length permits the existing nearest-seek-first algorithm to produce a true optimization according to the desired criteria. As queue loadings increase, the desireability to the system as a whole to increase the throughput of a certain IO type increases. (This desirability is evaluated by the site and set as a tuning parameter.) The specific situation of loading and Multics Technical Bulletin MTB-635 Disk Volumes priority is represented as a point along the defined line in the loading/optimization plane. The relationship of this IO type to other IO types is defined by the produced optimization value for this type in relation to the produced optimization values for the other IO types. So the situation is widely dynamic and operates beyond the bounds of the two dimensions input as tuning parameters. By indicating different optimizations and loadings for each IO type it is possible to have their optimizations cross each other at different points during normal system operation. For example: A site sets tuning parameters such that VTOC read is always maximally optimized by indicating optimization = 1, loading = 1. VTOC write is seen as a clearing operation which will not block the system, but which should not be left too long due to the constrained resource of VTOC buffers and storage system consistency. So parameters are set to have an optimization of the number of cylinders of a drive for the first VTOC write, but to fully optimize if 3 VTOC writes are outstanding. This gives complete separation for non-on-cylinder VTOC operations. A page read is seen as a high priority operation, since it blocks, but as less demanding than a VTOC read, which is necessary to unlock access to a number of potential page reads. So the site sets initial optimization at 1/4 of a drive's cylinders (200) but requires full optimization if more than 1/2 of 'maxe' process's are waiting for pages to optimize multi-programming. A page write is seen as the lowest priority of all, but it will cause blocking if too many are queued up. So the site sets initial optimization as the number of cylinders of a drive, but requires full optimization at 1/2 'free-q' allocation. As can be seen a number of factors have been considered and are in effect. The instantaneous optimization of the system will take into account all the above situations dynamically. For example, VTOC reads will fly through, but if we get up to 3 VTOC writes per drive they will get fully optimized too. Page reads will get nearly maximal throughput, and will fully optimize if too many processes get bottlenecked on any particulary drive. But if we get up to 3 VTOC writes outstanding they will surpass page reads in optimization till the demand slacks off. Finally page write will be allowed to queue up to a high degree, but not high enough to start to block system operation. MTB-635 Multics Technical Bulletin Disk Volumes What is perhaps not totally obvious in addition is the effect of grouping which will occur through this optimization technique. For example the optimization of any IO type not only depends upon the optimization factor applied, but also the nearness of the true physical position of an IO seek of its type, in relation to the nearness of the true physical position of an IO seek of another type. Thus we may hold off doing writes for a while til they build up, but when we start to do them the statistics are fairly good that we will be able to do a high degree of local seek length optimizations through the buildup of candidates within that area. When the span between areas, in relation to the current queue loadings, reaches a dynamic separation point, we will return to doing optimization of the higher priority IO and will probably be able to do group optimization of them too. So the optimizations afforded by the above method go well beyond the simple possibilities of a non-dynamic method, and in fact out-reach the imaginations of those entering the parameters. It is a means to put extra intelligence into the managing of a computer system as a whole, and not just the storage system, but an intelligence which follows exactly the dictates given to it, though the final effect may well surpass the generality that was presumed for it. In other words, it will do what you want, even in situations you might not have accounted for, and which you do not have to account for. HISTORY Some history of these proposals is appropriate. About three years ago they were first conceived, though in a rougher form. Over the suceeding three years they have been put into effect to a slightly limited extent on a UNIX system owned by the Department of Computer Science, running on a VAX 11/780. On this system, which had a difference queuing method without the locking and 'free_q' problems of MULTICS, only the adaptive optimization technique, and a correctly functioning 'nearest-seek-first' algorithm needed to be created, and this was done according to a design document similar to this which was supplied to the systems programmers of the UNIX system. To this point the adaptive optimization has performed without flaw, and appears to be quite robust, with a high degree of tolerance to a wide range of tuning parameters. The UNIX system has also benefited from the extra statistics and meters which the modifications made possible. Multics Technical Bulletin MTB-635 Disk Volumes To date there is no one thing which can be pointed to with flag waving, there are no spectacular situations in which the optimization really becomes apparent. However they have noted that it is much more difficult, while running 'emacs' to determine that the system is loaded, and for the first several months of existence of the optimization the ability of the systems programmers to sense the loading of the system by their old performance measures always produced much lower loading levels than were actually the case when meters were consulted. Through rough testing with thrashing programs it is easily possible to bring the disk drives to individual busy levels of 80-92% without significant queue buildup, and in most cases system responsiveness is maintained much better than without the optimization. It is very infrequent when any significant queue buildup of writes can be noticed, but some situations have occurred where a queue buildup of 150 elements was maintained for any prolonged period, with a reportedly good system response. As a result it is quite desireable to be able to produce better measures of sucess and tuning than have been available, certainly we should progress beyond the seat-of-the-pants feeling and get quantitative measures. Indications to this point are that the optimizations should produce a better system for total system throughput than can be achieved by previous methods, including disk combing, but no hard numbers stand to attest this. Though the above sections appear to enter into the world of science fiction/fantasy and intelligent machines, this is not really the case. It is merely a situation where the statement of the rules provided by the system are interpreted to be able to provide a similacrum of thought in the optimization of the system. The driver does not originate anything, it simply follows the rules provided. The fact that the rules are in some sense a valid mixture of different critera (apples and oranges?) provides much of the groundwork to enable the system to work. In essance the tuner is not stating 'do this at this time', but instead is laying down conditions which must be fulfilled by the driver, and is able to state these conditions in terms of disk seek priority and queue loadings.