Multics Technical Bulletin                                MTB-563
DM: Ordering of Disk I/Os

To:       Distribution

From:     John J. Bongiovanni

Date:     12/03/81

Subject:  Data Management: Ordering of Disk I/Os

1 INTRODUCTION

     The  Multics Data  Management Architecture  will provide for
integrity of data  across system crashes, even in  the case where
Emergency  Shutdown (ESD)  does not  succeed.  The  mechanism for
this  protection is  a set  of journals  which record  before and
after  images  of modified  data  (reference MTB-xxx  for  a full
discussion of  this capability).  In order  for this mechanism to
function  correctly, the  sequence of  I/Os to  the various files
involved must be controlled, as  the system could crash during an
update  of a  data base.   In the  initial implementation  of the
Multics   Data  Management   Architecture,  all   files  will  be
implemented  as  Multics segments,  and hence  the I/Os  to these
files  will be  done by  Page Control.   Page Control  at present
provides  no  general  mechanism  for sequencing  I/Os;  at best,
update protocols  using the force_write capability  could be used
to  this  effect (at  some  expense in  efficiency).   This paper
describes a  design approach for  explicit sequencing of  I/Os by
Page Control.   A brief description of  the functionality desired
is presented  first.  Next, the design  is outlined in sufficient
detail to allow an estimate  of its efficiency.  Some alternative
design  approaches  are  discussed,  followed  by  implementation
considerations.

Send comments on this MTB by one of the following means:

     By Multics Mail, on MIT or System M:
        Bongiovanni.Multics

     By Telephone:
        HVN 261-9314 or (617)-492-9314

_________________________________________________________________

Multics  project  internal  working  documentation.   Not  to  be
reproduced or distributed outside the Multics project without the
consent of the author or the author's management.



                            CONTENTS

                                                         Page

                 1 Introduction . . . . . . . . . . . .     i
                 2 Functionality  . . . . . . . . . . .     1
                 3 Design . . . . . . . . . . . . . . .     2
                    3.1 Overview  . . . . . . . . . . .     2
                    3.2 Call-Side Services  . . . . . .     2
                    3.3 Claim . . . . . . . . . . . . .     3
                    3.4 Done  . . . . . . . . . . . . .     3
                    3.5 Find_Core . . . . . . . . . . .     4
                 4 Alternative Design Approaches  . . .     4
                 5 Implementation Considerations  . . .     5

Multics Technical Bulletin                                MTB-563
DM: Ordering of Disk I/Os

2 FUNCTIONALITY

     Without loss of generality, there  are two files of interest
(each may  be considered a  Multics segment):  a data  base and a
Before Image  Journal.  Prior to  modifying any data  in the data
base,  the previous  values of the  data must be  recorded in the
Before Image  Journal.  To guarantee recovery  across any type of
system  or  process  failure,   it  is  necessary  (although  not
sufficient) that no  modification to the data base  be written to
the  data  base   file  on  disk  until  the   pre-image  of  the
modification  has been  written successfully to  the before image
journal on disk.

     Since the  files of interest  are Multics segments,  data in
the files is written as  pages.  The functionality to be provided
by Page Control to user-ring programs is to link any two pages to
which the user has effective write access, such that Page Control
will not write  the second page to disk until  the first page has
been  written to  disk successfully.   With this  capability, the
following sequence  will guarantee that a  modification to a data
base  is  not written  to  disk before  the  corresponding before
journal image:

   o Build pre-image into in Before Image Journal

   o Call Page  Control to link  data base pages  to Before Image
     Journal page

   o Modify data base pages

Note that the writing of pages to disk remains non-deterministic,
in the sense that pages are still written to disk by page control
based  on  the  demand  for  real  memory  pages  in  the system.
Although the ordering of some  of these writes is controlled, the
writes occur at random times.

     It is possible that before images for a given data base page
may reside in  several Before Image Journal pages,  and this case
must be handled correctly.  The sequence of events is as follows:

   o Build  pre-image  of  data  base page  A  into  Before Image
     Journal page 1

   o Call Page Control  to link data base page  A to Before Image
     Journal page 1

   o Modify data base page A

   o Build  pre-image  of  data  base page  A  into  Before Image
     Journal page 2


MTB-563                                Multics Technical Bulletin
                                        DM: Ordering of Disk I/Os

   o Call Page Control  to link data base page  A to Before Image
     Journal page 2

   o Modify data base page A

     The   difficulty   with   this    case   arises   from   the
non-determinism  of writing  pages to disk.   At the  time of the
second call  to Page Control  above, Before Image  Journal page 1
may not have been written to disk (in fact, the write I/O may not
have  been  queued).   In  this circumstance,  Page  Control must
ensure that  data base page A  is not written to  disk until both
Before  Image  Journal pages  (1  and 2)  have been  written.  No
additional  information is  required by  Page Control,  since the
case is easily detected (at the second call to Page Control, data
base page A would be linked to a different page).

3 DESIGN

3.1 Overview

     There  is a  Core Map Entry  (CME) for  each physical memory
page on a system.  Linked lists  of CME's may be formed such that
no page in  a list (other than the head)  will be written to disk
until the head of the list  has been written to disk successfully
(i.e., without  errors).  Intuitively, the head  of a linked list
represents  a  Before  Image  Journal  page,  and  the  remaining
elements  of  the linked  list  represent data  base  pages whose
pre-images are  in the Before  Image Journal page at  the head of
the  list.   CME's  are  linked  into  list  at  the  request  of
user-ring.  When  a write I/O  for a page  at the head  of a list
completes  successfully,  the  associated list  is  unlinked.  In
general,  pages are  written to disk  by the  normal Page Control
mechanisms, except that the writing of certain pages is skipped.

3.2 Call-Side Services

     There   is   one  call-side   service   provided--viz.,  the
capability to  link page A to  page B so that page  A will not be
written  to disk  until page  B has  been written  to disk.  This
capability is implemented as follows:

   o Validate that  the process has  write access to  both page A
     and  page  B  at  the  current  validation  level.   If this
     validation fails, return an error code.

   o Touch page A.


Multics Technical Bulletin                                MTB-563
DM: Ordering of Disk I/Os

   o Lock  the  Page Table  Lock.  If  page A  is not  in memory,
     unlock the  Page Table Lock,  and repeat the  previous step.
     This  is  non-deterministic, but  eventually the  Page Table
     Lock will be held with page  A in memory.  In practice, page
     A  will already  be in  memory when  touched in  ring-0 (the
     previous  step),   due  to  the   protocol  described  above
     ("Functionality").

   o If page A  is at the head of a  linked list of CME's, unlock
     the Page Table Lock, and return an error code.

   o If  page B  is not in  memory and modified,  unlock the Page
     Table Lock and return.

   o If there  is a write I/O  in progress to page  B, unlock the
     Page Table  Lock, wait for  the I/O to  complete, and repeat
     the previous steps.

   o If there  is a write I/O  in progress to page  A, unlock the
     Page Table  Lock, wait for  the I/O to  complete, and repeat
     the previous steps.

   o If page A is a member of  a linked list of CME's, initiate a
     write I/O to the page at the head of that linked list (if an
     I/O is not in progress already), unlock the Page Table Lock,
     wait for the I/O to complete, and repeat the previous steps.
     An I/O  should not be initiated  if the page at  the head of
     the  linked  list  is  page  B  (that  is,  there  have been
     successive calls to link a data base page to the same Before
     Image Journal page).

   o Link page  A to page B  as the second element  of the linked
     list whose head is page B.

   o Unlock the Page Table Lock and return.

3.3 Claim

     Claim is  the service of Page  Control which writes modified
pages to disk (it is called during page fault processing).  Claim
must be  changed to skip pages  which belong to a  linked list of
CME's and are not at the head of the list.

3.4 Done

     Done is the service of Page  Control which is called to post
an interrupt for  a page I/O.  Done must  be changed to recognize


MTB-563                                Multics Technical Bulletin
                                        DM: Ordering of Disk I/Os

the completion  of a write I/O  to a page which is  the head of a
linked list of CME's.  When this happens, the list is unlinked.

3.5 Find_Core

     Find_Core  is the  service of  Page Control  which selects a
page to  evict from memory  when a page  fault occurs and  a page
must be read from disk to satisfy the page fault.  Find_Core must
be changed not to evict a page which is a member of a linked list
of CME's.

4 ALTERNATIVE DESIGN APPROACHES

     If it is frequently the case  that pre-images of a data base
page  are  recorded in  several Before  Image Journal  pages, the
approach   outlined  above   could  cause   unacceptable  delays.
Consider the following example:

   o A pre-image for data base page A is recorded in Before Image
     Journal page 1.

   o Page Control is called to link page A to page 1.

   o A pre-image for data base page A is recorded in Before Image
     Journal page 2.

   o Page Control is called to link page A to page 2.

If, at this point, page 1 has not been written to disk, the write
I/O  will  be  requested,  and  the  process  will  wait  for the
completion of the I/O.  The delay will be approximately the queue
time plus the device time for  the I/O, which is dependent on the
system load.  This  delay can be reduced by  the following means,
some of which are modifications to the design presented above.

   o When Page Control links page A  to page B, it could initiate
     the  write I/O  for page B  at that point.   Since there are
     (presumably)  many data  base pages associated  with a given
     Before  Image  Journal  page, this  approach  would generate
     extraneous I/Os.

   o When the Before Image Journal  overflows a page, the journal
     manager could  force-write the page.  This  would reduce the
     delay without generating extraneous I/Os.

   o A  linked list  could represent  a strict  ordering of I/Os.
     That is, if a linked list of CME's contained pages A, B, and
     C;  page  B  would not  be  written  until page  A  had been


Multics Technical Bulletin                                MTB-563
DM: Ordering of Disk I/Os

     written, and  page C would  not be written until  page B had
     been written.   In the example above,  the linked list would
     consist first of (page 1, page A), and then of (page 1, page
     2,  page A).   This approach would  reduce I/O simultaneity,
     probably reducing efficiency.

   o A  more complex  linked list  structure could  be developed,
     under which a given page could be the head of one list and a
     member of  another list.  In the  example above, there would
     be two linked lists, one consisting of (page 1, page 2), and
     the other consisting of (page  2, page A).  The problem with
     this  approach  is  that  it would  require  the  CME  to be
     expanded to contain the additional link information.

5 IMPLEMENTATION CONSIDERATIONS

     The necessity  to wire the  ring-0 stack before  locking the
Page   Table  Lock   could  introduce   substantial  delays.   An
alternative  implementation  is  to  switch  to  the  PRDS before
locking the Page  Table Lock.  This would require  a more complex
implementation, since  none of the  waits involved could  be done
using the PRDS as a stack.   The additional complexity may not be
significant,  as none  of the waits  could be done  with the Page
Table Lock held.

     The sofware which manages the Before Image Journal must take
care to  modify pages in the  Before Image Journal in  such a way
that  a crash  at any  time leaves  the page  image on  disk in a
consistent state.  Since I/O is asynchronous to the processing of
this software, modifications to  Before Image Journal pages could
occur  while  a previously  requested write  I/O is  in progress.
Careful  programming  can  eliminate   adverse  effects  of  this
asynchrony.