Multics Technical Bulletin                                MTB-550
DM: index_manager_ design

To:       Distribution

From:     Lindsey Leroy Spratt

Date:     01/12/83

Subject:  Data Managment: Index Manager Design


     This document describes various facets of the implementation
of the index manager.


Multics  project  internal  working  documentation.   Not  to  be
reproduced or distributed outside the Multics project.

MTB-550                                Multics Technical Bulletin
                                        DM: index_manager_ design

Comments should be sent to the author:

via Multics Mail:
   Spratt.Multics on either MIT Multics or System M.

via US Mail:
   Lindsey Spratt
   Honeywell Information Systems, inc.
   575 Tech Square
   Cambridge, Massachusetts 02139

via telephone:
   (HVN) 261-9321, or
   (617) 492-9321



                 1 Abstract . . . . . . . . . . . . . .     i
                 2 Introduction . . . . . . . . . . . .     1
                 3 B-Trees  . . . . . . . . . . . . . .     1
                 4 Index manager utilities  . . . . . .     2
                 5 Storage techniques . . . . . . . . .     2
                    5.1 Index control intervals . . . .     3
                    5.2 Key structure . . . . . . . . .     3
                    5.3 Key prefixes  . . . . . . . . .     4
                    5.4 Determining an appropriate set
                     of prefixes  . . . . . . . . . . .     6
                 6 Interface for manipulating keys in
                  control intervals . . . . . . . . . .     6
                 7 Key layout . . . . . . . . . . . . .     6
                    7.1 The field_table structure . . .     7
                 8 Allocation of keys . . . . . . . . .     9
                    8.1 Allocation of keys to control
                     intervals  . . . . . . . . . . . .     9
                       8.1.1 Rotating keys  . . . . . .     9
                       8.1.2 Splitting control
                        intervals . . . . . . . . . . .    10
                       8.1.3 Overlength keys  . . . . .    10
                    8.2 Allocation of a key within a
                     control interval . . . . . . . . .    10
                 9 The search algorithm . . . . . . . .    11
                    9.1 Basic approach  . . . . . . . .    11

Multics Technical Bulletin                                MTB-550
DM: index_manager_ design


     The  index_manager_ subsystem  maintains a  "sorting" index,
where the keys of which the index is composed may be composed, in
turn, of several fields of  differing data types.  Strategies are
implemented which reduce the number  of pages touched in a search
of the index, and which  reduce the overall storage requirements.
The  algorithm  for  building  and  maintaining  the  index  is a
modified form of the B-tree  algorithm.  The storage reduction is
due  to the  use of a  "prefix extraction"  algorithm.  The basic
concept behind  this is to store  data only once which  is at the
beginning of several keys.  The data which is stored only once in
this fashion is a "prefix".


     In a survey of B-tree algorithms, "The Ubiquitous B-Tree" by
Douglas Comer in Computing Surveys,  Vol.  11, No.  2, June 1979,
the  various  B-tree algorithms  are  divided into  several major
categories; the  basic B-tree, the B*-tree  and the B+-tree.  The
technique employed by index_manager_ is in the B+-tree category.

     There are  two basic ideas  in the B-tree.   First, keys are
allocated in "control intervals"  or "pages".  A control interval
is a piece  of storage which can be read  from or written to disk
in a single  operation.  Consecutive keys are placed  in the same
control interval.  If  a key wont fit in  the control interval of
the keys  which are to  either side of  it, then it  is placed in
another  control  interval which  is pointed  to by  the "branch"
pointer which lies between the lesser and greater keys which this
key  was  to be  placed  between.  "Nodes"  in  a B-tree  are the
control intervals in which keys are placed.

     The second idea is to allocate the storage of keys in such a
way as to keep the tree of control intervals, or nodes, balanced.
The nodes at  the bottom of the tree are  the leaf nodes.  In the
basic  B-tree, a  key can be  stored in  a branch node  or a leaf
node, depending on  where there happens to be  room and where the
key falls in the sort of the keys already in the index.

     The   B+-tree   has   the   primary   difference   that  all
user-specified keys  are stored in  the leaf nodes.   The keys in
the branch nodes  serve only to provide a roadmap  to the keys in
the  leaves.   The  leaf  nodes  are  linked  together  to  speed
sequential access of the keys.  Since  a key in a branch node may
have  any value  as long as  its value  discriminates between the
keys which are its children, the primary criterion for choosing a
value  for  a branch  key  is to  choose  the shortest  value.  A
convenient technique for determing the  shortest value is to find

MTB-550                                Multics Technical Bulletin
                                        DM: index_manager_ design

the common prefix which the  two keys being discriminated between
have (which may  well be nothing) and add one  more piece of data
beyond the prefix from the higher  of the two keys.  Depending on
the data type involved, this one more piece of data may be a bit,
a  character,  or an  entire field.   Comer refers  to this  as a
"Prefix  B+-tree".   ("Prefix extraction",  as used  elsewhere in
this MTB, refers  to a technique of key  compression and the term
"prefix"  will  henceforth  only be  used  to refer  to  this key
compression technique and not in the sense which Comer uses it.)


     The index manager uses three utilities, collection_manager_,
data_mgmt_util_,   and  vector_util_.    The  storage  management
utility, collection_manager_,  deals directly with  the page file
manager and talks in terms of "elements" of the index collection.
These  elements  are  undifferentiated   bit  strings,  they  are
allocated  in "control  intervals".  A  control interval  is 1024
words long, one  Multics page.  At some point  in the future this
size will become settable on a per-page file basis, approximately
in the  range of 512  to 4096 words.   The collection_manager_ is
documented  in  MTBs 551,  "Data Management:   Collection Manager
Functional   Specification.",   and    552,   "Data   Management:
Collection Manager Design.".

     The   key   data  is   presented  to   the  caller   of  the
index_manager_ as a typed_vector structure (see MTB-541 "Toward a
unification  of  data  manipulation  on  Multics."   for  a brief
discussion  of data  vectors).  The  vector_util_ module provides
tools for creating and manipulating vectors of data.

     The  internal  representation of  keys  is as  a continuous,
peculiarly  formatted,   bit  string.   This   representation  is
sometimes referred to as the "string" representation, in contrast
to  the  "vector"  representation  used to  communicate  with the
caller  of the  index_manager_.  Converting  between the "vector"
representation and the  "string" representation, and manipulating
and interpreting the contents  of the "string" representation are
all  functions performed  by data_mgmt_util_ (which  is also used
for this purpose by the record_manager_).


     Keys   are   stored   as  elements   in   collections  using
collection_manager_.   The  elements  are  a  storage abstraction
managed as  undifferentiated bit strings.   The control intervals
in  an index  collection are  either branch  control intervals or
leaf control  intervals.  The branching  keys stored in  the same

Multics Technical Bulletin                                MTB-550
DM: index_manager_ design

control interval may have some amount  of data in common in their
"most  significant  bits".  The  optimum  set of  common prefixes
between keys in a control interval is used to minimize storage.

5.1 Index control intervals

     Each  of the  two kinds  of index  control intervals  has an
associated data structure.  The  branch control interval uses the
branch_ci_header  structure and  the leaf  control intervals uses
the  leaf_ci_header  structure.    These  control  intervals  are
threaded together in a tree.  Each level of the tree, in addition
to being threaded to its  parents, is threaded in a doubly-linked
list to its immediate siblings to aid sequential access.

     The  headers  are stored  as  elements with  slot  number 1.
Every control interval  in an index collection has  either a leaf
or   a   branch   header.     The   headers   are   declared   in
dm_im_ci_header.incl.pl1 as shown below:
     dcl     1 common_ci_header     based (common_ci_header_ptr),
               2 flags              unaligned,
                 3 is_leaf          bit (1) unaligned, /* ON for leaf_ci. */
                 3 pad              bit (17) unaligned,     /* Must be zero. */
               2 key_tail_space_used_since_last_prefix_compaction
                                    fixed bin (18) unsigned unal,
               2 key_range          unaligned,
                 3 first            fixed bin (18) unsigned,
                 3 last             fixed bin (18) unsigned,
               2 parent_id_string   bit (36) aligned,
               2 previous_id        fixed bin (24) unsigned unaligned,
               2 next_id            fixed bin (24) unsigned unaligned,
               2 pad                bit (24) unaligned;

     dcl     1 leaf_ci_header       based (leaf_ci_header_ptr),
               2 common             like common_ci_header;

     dcl     1 branch_ci_header     based (branch_ci_header_ptr),
               2 common             like common_ci_header,
               2 low_branch_id      fixed bin (24) unsigned unaligned,
               2 pad                bit (12) unaligned;

5.2 Key structure

     The   keys  are   allocated  within   the  control  interval
structures.  Every  key which is inserted  into the index appears
in its entirety  in the leaf control intervals.   The data in the
branch control  intervals is only  present to guide  the searches
against the keys.  The values of the branching keys are chosen to

MTB-550                                Multics Technical Bulletin
                                        DM: index_manager_ design

be  the minimum  necessary length  to differentiate  between leaf
keys.   Every  branching key  has  a branch  control  interval id
associated with  it.  This identifies the  control interval which
contains keys  which are between  the branching key  and the next
higher branching key in the same control interval.

     There are  different key structures  for the branch  and the
leaf control intervals.  The keys in the branch control intervals
are formatted according to the  branch_key structure, the keys in
the  leaf  control  intervals  are  formatted  according  to  the
leaf_key structure.  In both cases,  the field_table must be used
to  decode the  "contents" of  the key.   The key  structures are
declared in dm_im_key.incl.pl1 as shown below:

     dcl     1 leaf_key             based (leaf_key_ptr) unaligned,
               2 string             bit (lk_string_length) unal;

     dcl     1 branch_key           based (branch_key_ptr) unaligned,
               2 branch_id          fixed bin (24) unsigned unaligned,
               2 last_field_idx     fixed bin (12) unaligned unsigned,
               2 string             bit (bk_string_length) unal;

5.3 Key prefixes

     There  may  be several  prefix  groupings of  keys,  where a
"prefix grouping" is  all of the keys in  a control interval with
the  same prefix.   This prefix  is stored  as an  element of the
branch_index_control_interval,  with   an  element_position_table
slot less than  the value of first_key_in_element_position_table.
If there  aren't enough slots,  the keys have to  be shifted down
one  in  the  element_position_table  to make  room  for  the new
prefix.     The    prefixes    are    placed    first    in   the
element_position_table because they are less volatile than keys.

           dcl 1  prefix            based (prefix_ptr)
               2 first_slot
                                    fixed bin(12)
                                    unsigned unaligned,
               2 last_slot
                                    fixed bin(12)
                                    unsigned unaligned,
               2 number_of_fields fixed bin (17) unaligned,
               2 final_field_is_fragment
                                    bit(1) unaligned,
               2 pad                bit(5) unaligned,
               2 contents           bit  (prefix_contents_length);

     Only the "tails" of the  branching keys need be stored, now,

Multics Technical Bulletin                                MTB-550
DM: index_manager_ design

since the  entire branching key  can be reconstructed  at need by
prefixing the "tail" portion with  its prefix.  It is possible to
break a prefix and a tail across  a field of the key if the field
is of a string data-type.

     Since the prefix may include any number of fields, from none
to all, the number of fields  actually present in the prefix must
be stored with the prefix.  If  there are N fields present in the
prefix, they  must be the fields  with field ids of  1 through N.
Also,  a bit  is set  to indicate  whether the  Nth field  of the
prefix is  complete, or a  fragment completed in each  of the key
tails.  Since  only the Nth  field can be  fragmented, there only
needs to be one "field_is_fragmented" bit.

     The worst case  possible for a set of  keys, with respect to
number of keys  having a common prefix, is  0.  However, a subset
may  exist for  which there is  a common prefix.   This prefix is
"factored" out of  the keys This is done for  each subset of keys
with a common prefix in the control interval.  In this fashion it
is  possible to  have several prefixes  for the keys  in a single
control interval (each key having only one prefix, of course).

     Since  the  intent of  the  prefixing technique  is  to save
storage space by not replicating  data, it is appropriate to only
do  prefix-extraction  when space  will  actually be  saved.  The
space saved  is the amount  of space the replication  of the data
would take minus the amount of space for storing the prefix, or:

          = replication_space_use - prefix_space_use,

          = (prefix_length * number_of_keys_with_prefix)
               - (prefix_overhead + prefix_length),

          = prefix_length * (number_of_keys_with_prefix - 1)
               - prefix_overhead.

     Prefixes  are only  extracted if  the prefix_space_saving is
great enough to make it worthwhile.  This means not only that the
prefix_space_saving must be greater than  0, but it must be great
enough to compensate for the  extra indirection (to construct the
whole key).  Prefixes can provide a small savings in computation,
when two  successive comparisons are  against keys with  the same
prefix.  In this situation, the result of the previous comparison
can be  re-used.  In general,  however, the use  of prefixes will
increase cpu cost.

     Prefixes are determined under two circumstances.  When a key
is being inserted,  it is known (from the  search process used to
determine where to insert the key)  what, if any, of the existing

MTB-550                                Multics Technical Bulletin
                                        DM: index_manager_ design

prefixes apply to  it.  The appropriate value is  recorded in the
previous_prefix_index  variable.   The  other   case  is  when  a
compaction is being done of the entire control interval.

     Compaction is  done when there  is insufficient room  in the
free space  pool to do  an insertion.  In  the simplest situation
compaction is only used to  recover scatterred free space left by
deleting  key   contents.   Prefix  extraction   is  done  during
compaction,  but  only  if  scattered  free  space  compaction is
insufficient                        and                       the
key_tail_space_used_since_last_prefix_extraction is  greater than
or equal  to the reqired  amount of space  to successfully insert
the new  key.  This will  require rebuilding the  set of prefixes
and the key "tails".

5.4 Determining an appropriate set of prefixes

     The algorithm for  selecting the set of prefixes  to be used
within a given control interval starts  with a sorted list of all
of the keys  in the control interval.  This  list is passed over,
from  first key  to last, building  a list of  prefixes which are
applicable to the  keys seen so far, testing  the prefixes of the
list against the  current key to keep the  current prefix list up
to date.  When done with this set of comparisons, there is a list
of  all of  the prefixes  which are  sufficiently common  to bear
extraction.  The mapping of keys-to-prefixes is used to get a set
of key tails.  Finally, the whole  mess of key tails and prefixes
is      put     into      the     control      interval     using


     When a key  is stored in the index  collection, it is broken
up  into  as  many  datums  as  necessary,  where  each  datum is
contained within a control interval.   All of the datums together
make  a  single  "element" of  the  index collection.   A  set of
primitives  exists for  allocating keys,  freeing them, rewriting
and retrieving.  The  key is known to these  primitives as only a


The   key_tail_space_used_since_last_prefix_extraction    is   an
approximate value, which is only useful as an optimization aid in
determining when to do a prefix analysis.

Multics Technical Bulletin                                MTB-550
DM: index_manager_ design

string of bits, an "element".  The  structure of the key as a set
of fields is known and used by the data_mgmt_util_ utility.


     The key is layed out within  the element.  The layout of the
key is  the either the  branch_key or the  leaf_key structure, as
appropriate.   The  layout of  the contents  is described  in the
"field_table" structure, described below.   The layout of the key
is  divided into  the fixed portion  of the key  and the variable
portion,  this  being divided  yet again  among prefixes  and the
"tail".  The fixed portion is composed of all of the fixed-length
fields,  and  the  length  data for  the  variable  fields.  This
minimizes  the   expense  of  accessing  the   fixed  fields,  as
intermingling  fixed-length  and   varying-length  data  requires
calculating each field's position for each key and separating the
fixed and varying field data  eliminates the need for calculating
the location of the fixed fields.

     The  variable  portion  of  the  key  consists  only  of the
contents  of  the  variable   length  fields.   The  field  table
describes the beginning location for  the fixed length fields and
the location of the length  value for the variable length fields.
It also describes the type of  each field, and the location where
the first variable field's contents start.

7.1 The field_table structure

     The field_table structure is  stored in the index_manager_'s
per-index-collection header information,  along with the location
of the root node of the index and  a set of counts of the keys in
the     index.      The     field_table     is     declared    in
dm_field_table.incl.pl1 as shown below:
     dcl     1 field_table          aligned based (field_table_ptr),
               2 version            fixed bin (17) unal,
               2 number_of_fields   fixed bin (17) unal,
               2 maximum_field_name_length
                                    fixed bin (17) unal,
               2 location_of_first_varying_field
                                    fixed bin (17) unal,
               2 field              (ft_number_of_fields
                                    refer (field_table.number_of_fields)),
                 3 name             char (ft_maximum_field_name_length
                                    .maximum_field_name_length)) var,
                 3 flags            unal,
                   4 descriptor_is_varying

MTB-550                                Multics Technical Bulletin
                                        DM: index_manager_ design

                                    bit (1) unal,
                   4 length_is_in_characters
                                    bit (1),
                   4 must_be_zero   bit (7) unal,
                 3 location         fixed bin (35),
                 3 descriptor       bit (36) aligned,
                 3 length_in_bits   fixed bin (35),
               2 varying_field_map  (ft_number_of_fields
                                    refer (field_table
                 3 field_id         fixed bin (17) unal,
                 3 varying_field_index
                                    fixed bin (17) unal;

     The field_table.field array has one  entry for each field in
the  key, varying  and non-varying.  The  varying and non-varying
fields may be intermingled in this array.

     The  field_table.field.location variable,  for a non-varying
field, identifies  the starting offset  in the key_string  of the
data  for the  field.  field_table.field.type  identifes the data
type of  the field which  is sufficient to  know the size  of the
value  for most  types.  For  some types,  e.g.  char_nonvar, the
field_table.field.size value  is used to determine  the length of
the value.

     For  a varying  field, field_table.field.location identifies
the  location  of  the  length   value  for  the  field  (in  the
key_string).   The actual  value of the  field is  found (for the
first               varying               field)               at
field_table.location_of_first_varying_field  in  the  key_string.
The      second      varying      field      is      found     at
field_table.location_of_first_varying_field  +  (length  of first
varying field).

     The length and contents of  varying fields are separated out
in  this  fashion to  help speed  references to  specific varying
fields,  by  allowing quicker  references to  the lengths  of the
preceding  varying  fields  than  that  allowed  by  the  obvious
alternative of storing:

length1 + contents1 + length2 + contents2 + length3 + contents3 +

where:  to pick up contents3  one positions to length1, reads it,
adds  it  in  and  positions to  length2,  reads  it,  skips over
contents2, reads length3, reads contents3.  Also, this allocation
technique allows  storing the fields  in the same  order as their
definition by the caller at index-creation time.

Multics Technical Bulletin                                MTB-550
DM: index_manager_ design

     The varying_field_map array speeds the location of any given
field, which is of a varying type.

     The varying_field_map (ordinality_of_varying_field).field_id
is used  to determine the index  into field_table.field array for
the Nth varying field.

     The                                        varying_field_map
(field_array_index).varying_field_index is used  to determine the
ordinality  among varying  fields of  a varying  field given it's
field_id, it's position in the field_table.field array.


     The technique for allocating keys  has two parts; finding an
appropriate control  interval and allocating  keys within control

8.1 Allocation of keys to control intervals

     The basic  scheme for determining which  control interval in
which  to  allocate  a  key follows:   The  tree  is  searched to
determine the correct position for  the new key.  This results in
a control interval id and a  slot being selected for the new key.
If  there  is sufficient  room  for the  new  key in  the control
interval  selected,  a  single  element  containing  the  key  is
allocated in  this control interval at  the selected slot.  There
are  two  cases  which arise  when  the  key doesn't  fit  in the
selected control  interval:  there is enough  room in the control
interval on  the same level of  the index to either  the right or
the left  of the selected  control interval, or  there isn't.  In
the first  case, a rotation of  keys is used to  make room in the
selected  control  interval.   In  the second  case  the selected
control interval is split into two.


     If  room  exists  in  the preceding  or  following "sibling"
control  intervals,  then keys  are "rotated"  out of  the target
control  interval  and into  the  one with  room.   This rotation
requires  changing  the  value  of   the  parent  key  which  was
previously used to discriminate the control intervals involved in
the  rotation, which  is dealt with  as a new  "insertion" in the
parent's level.   The key being  rotated into the  parent control
interval  is  placed  in  the same  slot  as  the old  key  it is

MTB-550                                Multics Technical Bulletin
                                        DM: index_manager_ design


     If there is no room for  a rotation, then the target control
interval  must  be  "split".   This  consists  of  a  new control
interval being allocated, half of  the keys of the target control
interval being placed in the new  control interval, and a new key
being  inserted in  the parent  control interval  to discriminate
between the target control interval and the new control interval.


     The   above   scenario  is   somewhat  complicated   by  the
possibility that  a key will  be too long  to fit in  one control
interval.  To store an overlength key it is necessary to break it
into multiple elements.  The number of elements required to store
a key is equal to the least  greater integer of the length of the
key  divided  by  the   maximum  allowed  element  size.   (i.e.,
number_of_elements        =       ceil        (key_length       /
maximum_allowed_element_size)   If   key_length  is   not  evenly
divisible  by the  maximum_allowed_element_size, then  one of the
elements of the key  is less than maximum_allowed_element_size in
length.  Otherwise,  all of the  elements are maximum  sized.  If
there  is  a  less-than-maximum  sized element,  it  is allocated
first, and in the same fashion  as an element which is the entire
key  as discussed  above.  For the  maximum-sized elements, which
necessarily  occupy  entire  control  intervals, a  set  of empty
control intervals is used.

8.2 Allocation of a key within a control interval

     Once the control interval has been selected in which the key
is to be inserted, it may  be necessary to alter the construction
of   the   control  interval   to  give   the  key   the  correct
element_position_table  index,  and  to  make  room  for  the key
contents.  The  first situation arises  because the key  order is
kept   by   storing   keys   within   a   control   interval   in
element_position_table index  order.  If a key  is being inserted
which must have  an index already in use,  then the existing keys
with  that index  and higher  must be  shifted down.   The second
situation, making room for the  key's contents, arises when there
is  insufficient  room  in  the  "unused"  space  of  the control
interval for  its contents.  This  is solved by  "compacting" the
control  interval.   There  are  two  kinds  of  compaction, used
according to circumstances; scatterred  free space compaction and
common prefix compaction.

     Scatterred free space compaction is accomplished by making a
"logical" copy of  the control interval by inserting  each of the

Multics Technical Bulletin                                MTB-550
DM: index_manager_ design

keys  in the  control interval in  a temp  control interval, then
physically  copying  this  "compacted"  version  of  the  control
interval back into the original control interval.  This increases
the  unused  space  of  the control  interval  because  there was
scattered free  space in the  original copy.  The  scattered free
space comes  from deleting (or rewriting  larger) keys and simply
leaving the storage their contents  occupy behind to be recovered
in this fashion.

     Common prefix compaction is extracting  from the keys in the
control  interval  all of  the  common prefixes,  for  storage as
mentioned above.  The basic problem in this kind of compaction is
to determine what the common  prefixes are which, when extracted,
will realize a savings in space.


     The   algorithm   for  searching   the  index   has  several
determining factors; the clustering  of keys in control intervals
to  reduce paging,  the presence of  multiple fields,  the use of
ranges in constraining the search,  the use of search constraints
which require sequential comparisons  of keys, and the possiblity
of more than one set of  constraints (the results of which are to
be or'ed together)  identifying the same key.  An  example of the
next to last  determining factor is:  "last name  ends in ""s""".
An example of the last determining  factor is:  "(name = Daniel &
age > 15) or (age <  30)".  The algorithm for satisfying a search
specification must take into account all of these things.

9.1 Basic approach

     The   search_specification   provided  by   the   caller  of
index_manager_      is      analyzed      to      produce      an
"interval_specification".  The interval_specification consists of
an  array  of  "intervals"  (about   one  per  and_group  in  the
search_specification).  Each interval has a  low value and a high
value.  An  interval is specified  to be all of  the keys greater
than the  low value and less  than the high value  (keys equal to
the end points are optionally  included in an interval, depending
on  the  and_group(s)  associated  with the  interval).   The key
closest  to each  end point  (and on  the appropriate  "side") is
found   using   the  basic   search  algorithm   (implemented  by
im_basic_search).    It   may  be   necessary  to   complete  the
determination of  which keys satisfy  the search_specification by
doing a  sequential search over  one or more of  the intervals of
keys    found     by    the    above     application    of    the