Multics Technical Bulletin                                MTB-564

To:  Distribution

From:  André Bensoussan

Date:  02/04/83

Subject:  Phasing Page Control and Before Journal

ABSTRACT

     This  MTB  describes how  Page  Control and  Data Management
cooperate in implementing the protocol  known as the "Write Ahead
Log" (WAL) protocol.

     When a data management file is modified, a "Before Image" is
logged  in a  Before Journal;  that is,  the portion  of the file
about to be  modified is saved in the journal,  and used later to
undo the modification  if a rollback is requested.   In order for
the rollback to operate properly even after an emergency shutdown
failure, it  is necessary to  hold the data  base modification in
main  memory  until  its  associated  before  image  is  actually
physically  written  to disk.   This  is the  essence of  the WAL
protocol.

     Since the first implementation of data management files will
be done using Multi Segment Files,  whose pages are moved to disk
by Page Control, the enforcement  of this protocol cannot be done
without  Page  Control's participation.   This MTB  describes the
respective  responsabilties  of  Page Control,  File  Manager and
Before  Journal  Manager in  their  contract to  enforce  the WAL
protocol.

_________________________________________________________________

Multics  project  internal  working  documentation.   Not  to  be
reproduced or distributed outside the Multics project.


MTB-564                                Multics Technical Bulletin

Comments should be sent to the author:

via Multics Mail:
   Bensoussan.Multics on System M.

via US Mail:
   André Bensoussan
   Honeywell Information Systems, inc.
   575 Tech Square
   Cambridge, Massachusetts 02139

via telephone:
   (HVN) 261-9334, or
   (617) 492-9334



                             CONTENTS

                                                         Page

                 Abstract . . . . . . . . . . . . . . .     i
                 1 Introduction . . . . . . . . . . . .     1
                 2 Abbreviations  . . . . . . . . . . .     1
                 3 Background information . . . . . . .     2
                 4 Description of the protocol  . . . .     3
                    4.1 Before Journal Manager protocol     4
                    4.2 File Manager protocol . . . . .     4
                    4.3 Page Control protocol . . . . .     5
                 5 Extension to several Before Journals     6

Multics Technical Bulletin                                MTB-564

1 INTRODUCTION

     In the first release of  the new Data Management, files will
still be implemented as MSF's and their pages will be written out
at page control's discretion.

     In order to be able to undo a set of modifications done by a
transaction,  the  Data  Management  uses  the  "Before  Journal"
technique:  Before modifying any portion  of a file, its original
value is recorded in a so-called "Before Image" (BI), appended as
a  logical  record  to  a  sequential  file  called  the  "Before
Journal".  If a  modified page is written out  to disk before its
before  image  is safe  on disk,  the rollback  mechanism becomes
vulnerable to a system crash with ESD failure.

     This  MTB describes  the methode  used to  make Page Control
cooperate  with Data  Management in  such a  way as  to have Page
Control  write out  data pages  to disk  only after  their before
images are safe on disk.

     If this can be achieved,  it gives the recovery mechanism of
the Data  Management an enormous advantage:   it can rollback all
unfinished  transactions  EVEN  AFTER  A  SYSTEM  CRASH  WITH ESD
FAILURE.

     If  it  could not  be achieved,  recovery after  ESD failure
would require reloading  the files that were open  at the time of
the crash, using their last  dumps, and applying all after images
recorded  in  the after  journal(s).   This is  a  very expensive
procedure compared to rolling back unfinished transactions.

     A  different  proposal to  achieve  the same  goal  has been
described  in  MTB-563:   "Data  Management:   Ordering  of  disk
I/O's", but has not been  implemented.  The method implemented in
the  Data Management  Sytem for MR10  is the  method explained in
this memo.

2 ABBREVIATIONS

The following abbreviations are used in this document:

     BJM = Before Journal Manager
     BI  = Before Image
     CI  = Control Interval
     FM  = File Manager
     ESD = Emergency Shut Down
     MSF = Multi Segment File


MTB-564                                Multics Technical Bulletin

3 BACKGROUND INFORMATION

     When the  before journal manager  is called to  journalize a
before  image,  it enters  the  before image  information  in the
current CI  of the journal, but  it does not write  the BI out to
disk at the time it records it.  The CI is, in fact, a page of an
MSF and it will be written out to disk by Page Control.  However,
when a transaction commits, the before journal manager causes all
CI's of the before journal to  be flushed (written to disk) up to
the  CI  containing  the  last  BI  generated  by  the committing
transaction, and waits for these I/O's to complete.

     The BJM is not informed each  time a CI (page) of the before
journal  has been  written on disk;  the interrupt  is handled by
page  control.   But it  can  however keep  track  of up  to what
control interval the journal is  completely on disk, each time it
requests the journal to be flushed.


Multics Technical Bulletin                                MTB-564

4 DESCRIPTION OF THE PROTOCOL

     Let us assume  that there is only one  before journal in the
system;  the  extension  to  several journals  is  simple  and is
discussed at the end of this document.

     It is  convenient, for the  description of the  protocol, to
use the following definitions:

   o A BI is "safe" if it is completely on disk, and all previous
     BI's are also safe.  A BI is "unsafe" if it is not safe.

   o A CI of the before journal is " safe" if it is completely on
     disk, and all previous CI's of the journal are also safe.  A
     journal CI is "unsafe" if it is not safe.

     Conceptually,the   journal  can   be  broken   up  into  two
contiguous parts:  a safe part, which contains all the safe BI's,
follwed  by an  unsafe part, which  contains all  the other BI's,
still  unsafe.  The  line that separates  the two  parts may very
well fall in the middle of a  safe CI, if it happens that this CI
contains a portion of a still unsafe BI.

     If each BI was time stamped at the time it is entered in the
journal,  the  time stamp  of the  last safe  BI would  always be
higher than the time stamp of any other safe BI, and always lower
than the time stamp of any unsafe BI.  If, in addition, each data
page modified and  in main memory had the time  stamp of the last
BI  associated  with its  modification, it  would be  possible to
determine if the data page could be  written out to disk or if it
had to  be held in main  memory, until its BI  becomes safe.  The
proposed method can be sketched as follows:

  o The BJM  maintains the time  stamp of the  last safe BI  in a
    wired down location available for Page Control to examine.

  o The FM stores in the standard header of each file CI the time
    stamp of the BI produced the last time the CI was modified.

  o Page Control writes  out a file CI only if  the time stamp in
    the CI header  is smaller than or equal to  the time stamp of
    the last safe BI maintained by the BJM.


MTB-564                                Multics Technical Bulletin

4.1 Before Journal Manager protocol

a.  When recording a BI:

  o Record the BI, starting at the current position in the before
    journal; the BI may span several CI's.

  o Generate a time stamp for this BI (the time stamp need not be
    recorded in the BI).

  o For each unsafe  CI, the BJM remembers the  time stamp of the
    last BI that  will become safe when the  CI becomes safe.  In
    order to do so, the BJM  associates the time stamp of this BI
    with the CI that happens to contain the end of the BI.

  o Return the time stamp of the BI to the caller, i.e., the FM.

b.  When committing:

  o The BJM remembers the last safe  CI from the last commit.  It
    knows  the CI  number n  in which  the committing transaction
    produced its last BI.  It causes the journal to be flushed up
    to  CI n,  and waits for  completion of all  I/O's.  When all
    I/O's are completed,  CI n becomes safe, as  well as all BI's
    entirely contained in the flushed CI's.

  o The  BJM kept  track of  the time stamp  of the  last BI that
    would  become safe  when CI n  would become  safe.  It stores
    this  time  in the  wired down  location containing  the time
    stamp of the last safe BI of  the journal, to be used by Page
    Control.

4.2 File Manager protocol

  o Before modifying a  CI of a protected file,  the FM calls the
    BJM to record the necessary  BI information and gets back the
    time stamp of the BI generated by the BJM.

  o It then stores this time stamp  in the standard header of the
    CI about to be modified.

  o Only then can it start modifying the control interval.

Note -- The standard CI header  contains the time the CI was last
modified.  The BI time stamp can be used to also be the time last
modified.


Multics Technical Bulletin                                MTB-564

4.3 Page Control protocol

     Page Control must be  able to know that a page is  a CI of a
protected  file.  The  FM, when creating  an MSF  component for a
protected  file,  will set  the  "protected file  switch"  (a new
switch) in the VTOC entry.  At segment activation, this switch is
moved in the ASTE.  With this assumption, Page Control would have
to do the following:

  o When Page Control decides to write  out a page, it should now
    check in  the ASTE if the  page is part of  a protected file.
    If not, it proceeds as if does today.

  o If the page does belong to  a protected file, it compares the
    time stamp stored in the CI with the highest safe time stamp.
    If it  is greater, the  page must not be  written out because
    its BI is not safe yet; if it is not greater, the page may be
    written out, but first its PTW must be faulted to prevent any
    new modification to  be done to the page  while it is written
    out.

This  protocol must  be followed by  all programs  that write out
pages to disk, that is:

   - by Page Control in the normal case
   - by the ESD procedure, and
   - by the program that flushes memory every 15 minutes.

Since page control makes the decision to defer the writing out of
a page using non ring zero information, it must rely on some kind
of  safety valves  to prevent  the pressure  on main  memory from
becoming too high.

  o First, it could  validate time stamps found in  data pages as
    well as  the time stamp  associated with the  before journal;
    all time stamps must be smaller than the current time.

  o Next, Page Control could inform BJM  each time it has to skip
    a  page by  adding 1  to a  count associated  with the before
    journal.  This causes  the BJM to flush the  journal when the
    count  becomes  "too  high,"   instead  of  waiting  until  a
    transaction commits to do it.

  o Finally, if it happens that the  BJM has not been invoked for
    a  long  time, the  count may  increase beyond  its threshold
    value  without  triggering  any corrective  action.   In this
    case, page control should have  a way to force the invocation
    of the BJM to flush the journal.


MTB-564                                Multics Technical Bulletin

5 EXTENSION TO SEVERAL BEFORE JOURNALS

     If there are  more than 1 before journal,  the BJM maintains
an  array  of  safe  time stamps,  one  for  each  journal.  When
returning the time stamp of the  BI, it also returns the index of
the journal, which is stored in the CI header with the time stamp
by  the  FM; Page  Control  then uses  this  index to  access the
appropriate time stamp in the array.