Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 To: Distribution From: Gregory A. Baryza Date: 20 August 1984 Subject: New Pattern-Matching Routines Via Regular Expressions 1. Abstract This document describes a set of subroutine interfaces and commands which perform textual pattern matching. The facilities described here include those supplied by the "search_file_" entries used by qedx, emacs, mail, and others. In addition, the syntax proposed is such that upwardly compatible extensions may be made in the future without jeopardizing existing applications. Comments on this MTB should be sent to the author - via Multics mail to: Baryza.Multics via posted mail to: Gregory A. Baryza Honeywell Information Systems, Inc. Four Cambridge Center Cambridge, Massachusetts, U.S.A. 02142 via telephone to: (HVN)-492-9315, (617)-492-9315 via Multics forum on System-M: >udd>Multics>Baryza>Online_Meetings>Regular_Expressions >udd>m>Baryza>mtgs>rx ________________________________________ Multics project internal documentation; not to be reproduced or distributed outside the Multics project. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines TABLE OF CONTENTS Section Page Subject ======= ==== ======= 1 i Abstract 2 1 Introduction 3 2 Definition of Regular Expressions 4 5 Commentary 4.1 5 . . The Null String 4.2 5 . . Bounded Iteration 4.3 6 . . Operator Precedence & Associativity 4.4 6 . . Strong vs Weak Alternatives 4.4.1 7 . . . . Strong Alternatives 4.4.2 7 . . . . Weak Alternatives 4.5 8 . . Recursively Defined Expressions 4.6 10 . . References 5 11 Input Conventions 6 13 The Pattern-Matching Process 6.1 13 . . Macro-components of Regular Expressions 6.2 14 . . A Model of the Matching Process 6.3 19 . . Labelling Substrings 7 21 Some Examples 7.1 21 . . Alternative Matches 7.2 24 . . Recursive Patterns 8 30 String versus Line Mode 9 32 Subroutines 9 33 . . rx_$get_escape_char 9 34 . . rx_$set_escape_char 9 36 . . rx_$translate 9 38 . . rx_$translate_old 9 40 . . rx_$execute 9 42 . . rx_$release 9 43 . . rx_$search 9 45 . . rx_$substitute 9 48 . . rx_$get_newline_count 9 49 . . rx_$sub_error_handler 10 51 Commands and Active Functions 10 52 . . rx_escape_char 10 53 . . rx 10 54 . . . . change 10 56 . . . . change_all 10 57 . . . . change_some 10 57 . . . . count 10 58 . . . . exclude 10 59 . . . . exclude_index 10 59 . . . . include 10 60 . . . . include_index 10 60 . . . . match_count 10 61 . . . . test Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 11 62 Include Files 11.1 62 . . rx_execute_ctl.incl.pl1 11.2 64 . . rx_match_info.incl.pl1 11.3 67 . . rx_error_info.incl.pl1 12 68 Comparative Timings 12.1 68 . . The Test Environment 12.1.1 69 . . . . Test 1 12.1.2 69 . . . . Test 2 12.1.3 69 . . . . Test 3 12.1.4 70 . . . . Test 4 12.1.5 70 . . . . Test 5 12.2 71 . . General Commentary on the Tests 12.3 72 . . Comparison Against A Compiled Program 13 73 Miscellaneous Topics 13.1 73 . . Use of National Characters 13.2 73 . . Installing it in the System 13.3 74 . . Starname Processing 13.4 74 . . Efficiency Considerations 13.4.1 74 . . . . Translator Optimizations 13.4.2 76 . . . . Sequences That Work Against Efficiency 13.5 77 . . Implementation Restrictions Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 2. Introduction String pattern-matching in PL/I and other high-level languages can be a chore at best. Although there is a great deal of power available to handle the characters themselves, the implementation of a matching strategy given an arbitrary pattern specification is not straightforward. Nonetheless, the efficient interrogation of text data is a necessity in many data-base applications, in word-processing, and in command parsing. This document is an(other) attempt to alleviate some of the character handling burden by providing a pattern-matching facility employing regular expressions. Unlike the common Multics line editors, however, all characters appearing in regular expressions are treated literally except those preceded by a user-specified escape character. It is conceded that this does not provide a smooth transition from the present "search_file_" interface. For that reason, a special entry has been included to provide for the same kind of pattern definitions as are presently available. These routines are intended to provide a reasonably complete set of pattern matching capabilities upon which other, more specialized or more specific, routines may be built. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 3. Definition of Regular Expressions A regular expression(1) is a pattern which specifies a set of | character strings; it is said to match certain strings. Regular expressions are used in searching for text strings satisfying some condition. Regular expressions are defined rigorously as follows. a) An ordinary character is a regular expression which matches that character. b) "^" is a regular expression which matches the null(2) character before the first character of a string. c) "$" is a regular expression which matches the null character after the last character of a string. d) "." is a regular expression which matches any non-null character. e) "[<string>]" is a regular expression which matches any one of the characters in the <string> and no others. It is illegal to include other regular expression sequences (like "." for example) within the characters of <string>. f) "[^<string>]" is a regular expression which matches any one character but the characters of <string>. ________________________________________ (1) Strictly speaking, the definition describes something more powerful than a recognizer for a regular grammar can cope with. The use of the term "regular expression" here is due to historical antecedents. (2) The "null" character referred to in this context is the character of length zero and not the ASCII character at 0/0. In this document, the ASCII character will be shown as <NUL> and the word "null" will refer to empty strings or characters. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 g) A regular expression followed by "*" is a regular expression which matches the largest number (including zero) of adjacent occurrences of the text matched by the regular expression. h) A regular expression followed by "<m:n>" is a regular | expression which matches at least m but no more than n | adjacent occurrences of the text matched by the regular | expression. This regular expression matches as many | adjacent occurrences as it can. The semantics of this | bounded count matching is discussed below. | i) Two adjacent regular expressions form a regular expression which matches adjacent occurrences of the the text matched by each of the regular expressions. j) Two regular expressions separated by "|" or "||" form a regular expression which matches the text matched by either of the regular expressions. The semantics of alternatives is discussed below. k) A regular expression enclosed by "(" and ")" is a regular expression which matches the same text as the original regular expression. Meta-parentheses are used to alter the order of evaluation implied by "g", "h", | "i", and "j"; for example, "a(b|c)d" will match "abd" or "acd", while "ab|cd" matches "ab" or "cd". l) If "<regexp>" is a regular expression, "{<regexp>}A" is a regular expression, where A is any ASCII graphic character. This regular expression matches the same strings as <regexp>; it has certain side effects which allow references to sub-portions of the string matched by the regular expression. m) "&" is a regular expression which recursively matches the regular expression in which it is contained. Left-recursive regular expressions may be defined via this route. Care must be taken in their use, however Their effectiveness depends on the actual strings to which they applied. Infinite loops are possible and will be detected and reported as errors at run-time. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines n) Nothing else is a regular expression. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 4. Commentary In most of the examples given throughout this document, the meta-characters which make up regular expressions are shown beginning with the backslash character, "". This is not the only representation. The user is free to choose any character as the "lead-in" or "escape" for meta-characters. See "Input Conventions" for more on representation. 4.1. The Null String According to definition "g" above, a regular expression has the potential for matching strings of zero length. That the expression "A*" should match "" is not surprising in itself. It seems most reasonable. However, what should "A*" match in the string "BBBB"? That is to say, how many null strings are there in the string "BBBB"? Certainly, there are any number of answers to this question; however, for purposes of this document and implementation, the regular expression "A*" matches the single occurrence of a null string which occurs (in this example) -- BEFORE the first character of the string -- BETWEEN each pair of characters in the string -- AFTER the last character of the string Thus, if we were to substitute a "-" for each occurrence of "A*" in "BBBB", the resulting string would be "-B-B-B-B-". | | 4.2. Bounded Iteration | | Definition "h" above gives a way to explicitly control the number | of times a particular regular expression match is attempted. | Both m and n must be unsigned integers, and the relation m <= n | be true. | | For ease of use, the following defaults are applied: | | -- if m is omitted, its value is taken to be zero. | | -- if n is omitted, its value is taken to be (effectively) | infinity. | MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines The following is a list of various formats and the | interpretations of each one: | | A<1:5> match any string containing at least one | but no more than five A's in a row | | B<:10> matches any string of up to ten B's in a | row (including zero) | | C<5:> matches five or more C's in a row | | D<8> matches exactly eight D's | | E<:> is equivalent to "E*" | | F<> is syntactically ambiguous and therefore | will cause an error at compile time | 4.3. Operator Precedence & Associativity The operators of the regular expression are listed below in order of decreasing precedence (binding power). "*" and "<m:n>" indefinite or bounded repetition | concatenation implied by juxtaposition of expressions "|" and "||" pattern alternation Furthermore, operators of equal precedence are left-associative so that the regular expression A|B||C is parsed as if it had been written (A|B)||C. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 4.4. Strong vs Weak Alternatives The use of "|" and "||" in regular expressions allows the implementation of alternative regular expressions during a pattern-match attempt. 4.4.1. Strong Alternatives In matching the expression defined by "|", an attempt is made to match the regular expression on the left of the "|". If the attempt succeeds, the next match attempted is the one following the entire pattern defined by "|". However, note is taken of the fact that there is an untried pattern to the right of the "|". Should a regular expression later fail to match, the "state of the match" will be restored to what it was at the time the "|" was detected, and an attempt will be made to match the regular expression to the right of the "|". (Of course, if the expression to the left of the "|" fails, the expression to the right will be tried immediately). This sequence of "remembering" and retrying will continue as long as untried alternatives exist. 4.4.2. Weak Alternatives The "||" functions somewhat differently. It still permits alternative match attempts, but once the expression to the left of the "||" succeeds, any untried expression to the right of the "||" is disregarded. Thus, "||" acts locally like the SNOBOL4 "fence". Another way to think about weak alternatives is that they are a | way of expressing preferences for the matches which should be | tried first. And furthermore, the preference specification is | such that, once a particular match is made, none of the remaining | options will be exercised. | MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines As an example, suppose we wish to parse a FORTRAN-like "common" | statement which consists of either the word "common" or its | abbreviation "com", followed by a variable name. An expression | of the form: | (common||com) *[ul][a]* | will properly match a statement like "common X" or "comY". It | will not, however, parse the statement consisting only of | "common" as placing the variable "mon" in the common area. There | is no way to prevent this latter interpretation using only strong | alternation. | To correctly interpret the statement "common2" as referring to | the variable "mon2" (variable names must begin with a letter), | the expression should read | (common *[ul]||com *[ul])[a]* | Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 4.5. Recursively Defined Expressions The use of the "&" allows regular expressions to be defined recursively. Its most common use is in the formulation of patterns which are "balanced" in the SNOBOL4 sense of "BAL()". When a "&" is encountered in the execution of a compiled regular expression, the state of the match is saved and the entire expression is re-invoked on that part of the target string which remains as yet unmatched. Of course, this re-invocation may itself cause another state-save and another invocation. This recursive "calling" may occur an arbitrary number of times(1) depending on the data in the target string. At some point, a given invocation will either succeed or fail without doing any more calls. If it fails, any alternatives in its caller are tried just as in the processing of "|". If it succeeds, it updates the state of the match on the target string to reflect the characters that it matched. It then returns to its "caller", who continues its own match attempt from the newly updated string positions. As an example of this process, the strings matched by the regular expression <&>|[abcdef][abcdef]* in the following target string are underlined. <<<bad>> + <deaf>> + dead One assumption which is made by the run-time routines is that each recursively invoked attempt to continue the match will make some amount of progress or "fail". If an invocation does not match any characters before invoking another level of recursion, the run-time support library will take this as evidence of an error condition and return the status code error_table_$fatal_error. ________________________________________ (1) The recursion level is also constrained by the amount of scratch storage available for saved match states. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 4.6. References Further information on regular expressions and pattern-matching are presented in the following articles. "A Theory of Discrete Patterns and their Implementation in SNOBOL4" J. F. Gimpel Communications of the ACM Vol. 16, No. 2 (Feb 1973), pp.91-100 "Regular Expression Search Algorithm" K. Thompson Communications of the ACM Vol. 11, No. 6 (June 1968), pp.419-422 "Derivatives of Regular Expressions" J. A. Brzozowski Journal of the ACM Vol. 11, No. 4 (Oct 1964), pp.481-494 Formal Languages and their Relation to Automata J. Hopcroft and J. Ullman Addison-Wesley Publishing Co. Reading Mass., 1969 U.S. Patent #3,568,156 Issued to K. Thompson, 1971 "QED Text Editor" B. W. Kernighan, D. M. Ritchie, K. L. Thompson Computing Science Technical Report #5 Bell Telephone Laboratories Murray Hill, N. J. May 1972 The SNOBOL4 Programming Language R. E. Griswold, J. F. Poage, I. P. Polansky Prentice-Hall Englewood Cliff, N. J. 1971 Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 5. Input Conventions Escape sequences are used for three purposes: 1) they are used to add special meanings to characters (i.e. make them into "meta-characters") so that they can be dealt with in a reasonably direct manner. 2) they provide a concise means of representing frequently occurring character sequences thus improving readability and decreasing programming error. 3) they are used to represent the occurrence of the character which is the "escape" character itself. Instances of the escape character are represented by "doubling" according to conventional practice. If "" is the escape character, a backslash character is represented as a "normal" character by the sequence, "\". The following sequences are allowed as an aid in program maintenance and readability. These sequences are recognized in | either case, i.e. both "n" and "N" are taken to mean the | newline character. They are listed in upper-case below for | brevity. | Char Interpretation | A the set of ASCII letters and digits | B the ASCII backspace character | C the set of ASCII control characters <NUL> thru | <US> and <DEL> | D the ASCII digits "0" thru "9" | E the current escape character | F the ASCII form-feed character | G the set of ASCII graphic characters | exclamation-point thru tilde | H the ASCII digits "0" thru "9", characters "a" thru | "f", and characters "A" thru "F" | MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines L the ASCII characters "a" thru "z" | N the ASCII newline character | O the ASCII digits "0" thru "7" | S the ASCII space character | T the ASCII tab character | U the ASCII characters A thru Z | V the set of valid ASCII characters | W the ASCII characters: horizontal tab, vertical | tab, form-feed, and space | In addition, an occurrence of the escape character followed by up | to three octal digits is interpreted as the character whose octal | is equal to that specified. Thus, assuming that "?" is the | escape character, "?12", "?n", and "?012" all represent the | newline character; and "?101" and "A" are equivalent. This | sequence may also be used to represent characters "outside" of | ASCII, e.g. "?776". | NOTE: All sequences employing the escape character as a lead-in character except those described in this document will be treated as illegal. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 6. The Pattern-Matching Process As with many things, reference material is not enough to provide an understanding of the object defined. Descriptions which suggest appropriate models and give examples are highly desirable. This section attempts to provide an operative, though slightly inaccurate, paradigm of pattern-matching and gives some examples with discussion. 6.1. Macro-components of Regular Expressions One of the most important cognitive steps in understanding the meaning of a regular expression is the ability to "clump" groups of characters into semantically more meaningful pieces. This takes practice, but is of great utility. It is equivalent to becoming comfortable enough with Multics starname processing to read dl bound_??*_.s.archive as a request to delete "all source archive files in the working directory whose names are bound_<something-or-other>_". In this vein, the following table gives some simple regular expressions along with a suggested way of thinking about them. Note that in each case, the expression given has many more components (according to the formal definition) than are described by the larger view. Expression Interpretation ^cat the word "cat" at the beginning of a string fox...* the word "fox" followed by at least two characters(1) ^$ a null string (contains no characters) ^.*$ all the characters of any string, even the null one ________________________________________ (1) The word, "fox", followed by two characters, followed by zero or more characters implies at least two characters after the word. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines declare|dcl the start of a PL/I data declaration(1) ca(n|b|t|re)s any of the words: "cans", "cabs", "cats", or "cares" ca[brn]s any of the words: "cabs", "cars", or "cans"(2) $[d][d]* an integral dollar amount "[01]*"b a PL/I bit-string of arbitrary length (including ""b) 6.2. A Model of the Matching Process Once we have decomposed a regular expression string into some reasonably high-level clumps, talking about the pattern-matching process becomes easier. One way to visualize what happens is to watch the interaction among three items: 1. an ordered list of the clumps making up the regular expression, 2. a string of characters which is to be searched for the desired pattern, 3. a cursor indicating the "current" position in the string. For illustration purposes, a cursor will be shown as pointing between the characters of a string. In special cases, it can also point before the first or after the last character of the string. The sample strings to be searched in this section will not contain blanks in them, so we will depict the position of a cursor within a string schematically as string: ABCDEF GHIJKLMN A| cursor: | ________________________________________ (1) Catenation of expressions (such as "d", "c", and "l") binds more tightly than alternation. Hence, No parentheses are necessary. (2) Note that tests for membership in a set of characters is often more readable than a list of single-character alternatives. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 In this case, the cursor is said to be between "F" and "G", after the "F", or before the "G" depending on which context makes the most sense. Once we have the ideas of "clumps" and "cursors", determining | whether a given regular expression can match a pattern somewhere in a specified string is a simple process. We must ask, for each "clump" in the list, the following question: "Does the pattern I am looking for occur immediately to the right of the cursor?" Each time the answer to the question is "yes", the cursor is advanced to the right by the appropriate number of characters Then, the next clump in the list gets to ask the same question. When all the clumps have, in turn, answered "yes", the regular expression has matched some portion (perhaps all) of the specified string. This simplified picture ignores issues of bookkeeping, such as where the cursor starts, how it is advanced, whether it must be remembered, and where it is left when things are done. It also does not touch on how alternatives are handled, or how patterns like ".*" know how many characters to match. Those items will be covered now. Before the first clump takes its first look at the target string, the cursor is positioned before (to the left of) the first character of the string. After all the clumps have answered "yes" and some portion of the target string has been matched, the position of the cursor is examined. If the cursor is not after the last character of the string, the list of clumps is asked to try and match the remaining (as yet unexamined) part of the string to the right of the cursor. This approach has two main consequences. First, a given regular expression may match several times at several different places within the target string. Second, if there are two or more matches, none of them overlap with any other.(1) ________________________________________ (1) There are a few special cases involving "^" and "$" in which this is not true. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines We have not discussed what happens when some clump in the list cannot find its pattern in the string to the right of the cursor. In the simplest case (a regular expression that doesn't contain "*", or flavors of "|"), the obvious happens. The cursor is restored to the position it had before the first clump was asked to decide.(1) If this point is after the last character of the string, the regular expression gives up and reports any matches it might have already made. Otherwise, the cursor is advanced to the right one position and the clumps in the list are asked to decide again. In the more complicated case where there are alternatives ("|" or "||") in a regular expression or where an indefinite repetition ("*") occurs, the actions taken are slightly more complicated. In each of these cases, the clump "remembers" where it was when it found one of the patterns it was supposed to look for. If one of the later clumps should fail to find its own pattern, rather than advancing the cursor and starting over at the head of the clump list, the most recent clump that has a remembered choice is given a chance to try again. In the case of an alternative, the clump tries to match the other pattern. In the case of an indefinite repetition, the match is shortened by one group. Then, the next clump after this point of recovery is asked to try again. In an obvious fashion, the opportunities to "try again" provided by this process are nested. Failure at one level is only reported back if all the possibilities at this and lower levels have been exhausted. This allows all possibilities to be systematically and reliably tried. ________________________________________ (1) This is not necessarily the beginning of the string. Successful matches may have been made in the target string prior to this attempt. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 As an example, the following table shows how this process may be visualized. The regular expression which will be used to illustrate the procedure is or.*ten$ which can be divided into the clumps given in the following table. Clump Interpretation or the word, "or" .* some number of characters, maybe none ten$ The word, "ten", at the end of the string Given that regular expression and those clumps as its interpretation, the following table shows the target string (with cursor), the "active" clump, and a short explanation of what is | happening. Step Target_String Clump Commentary 1. foreshorten or "or" doesn't match here. A| Since this is the first clump, | advance the cursor and try again. 2. f oreshorten or There is a match here. A| Advance the cursor by the | length of the pattern. Get the next clump. 3. for eshorten .* This matches the longest A| string it can. It gobbles up | the remainder of the string, but remembers there are other choices. The next clump gets control. 4. foreshorten ten$ There is no possibility of a A| match. Fall back. | MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 5. foreshorte n .* The indefinite repeat of step A| 3 gives up a character, moving | the cursor back one. The match attempt restarts with the next clump. 6. foreshorte n ten$ Still no match. Fail again. A| | 7. foreshort en .* Once again, a character is A| surrendered from the series | grabbed in step 3. And the next clump in the list gets control. 8. foreshort en ten$ There still aren't enough here A| to match. This attempt fails | miserably. But there is hope; some characters still remain from the match in step 3. 9. foreshor ten .* Another character whittled A| away in the name of progress. | 10. foreshor ten ten$ This one works! A| | At the completion of the last step in the preceding table, the first and third clumps have matched copies of themselves. The indefinite repetition has matched the string of characters, "eshor". Since the cursor will now be positioned to the right of the last character, no further attempts to match this pattern will be tried. It should be noted once again, that that this is only a model of how pattern matching works. In actuality, the compiler does many optimizations to reduce the amount of backtracking which must be done. However, the process shown above does provide a reasonably accurate model. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 6.3. Labelling Substrings Those people who are users of the text editor QEDX are already | familiar with the concept of a "label" for a regular expression. In the context of the "substitute" command, the occurrence of an "&" on the right-hand side it interpreted to mean a reference to whatever string was matched by the regular expression on the left side. This is what allows the command s/quick/&,/ to change the familiar sentence fragment The quick brown fox jumped ... into The quick, brown fox jumped ... Labelled substrings in the context of this facility are an extension of this interpretation. They allow references, not only to the entire string matched, but also to pieces of it. In the example of the preceding table, if the regular expression had been written as or{.*}Xten$ then "X" would name the sequence of characters "eshor". Assuming that QEDX supported this kind of facility, the command s/or{.*}Xten$/rXed/ | would change "foreshorten" into the (dubious) word, "reshored". MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines The "labelling" facility listed here allows up to 94 labelled substrings to be a part of each match; one for each ASCII graphic character. Nesting is permitted as in "{{..*}X}Y" which names the same substring with the tags "X" and "Y". It is illegal, however, to define a regular expression which causes the same label character to be defined simultaneously to two separate strings in the same match. Thus, {.....}X{.....}X is always illegal (and can, in general, only be detected at run-time), while {[D]}X|{[L]}X is correct usage (and will always work, in fact). Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 7. Some Examples This section contains some detailed examples of the pattern matching process. While they do not illustrate all the possibilities, there are enough here to illustrate the general methodology. 7.1. Alternative Matches This example demonstrates the use of alternatives in pattern matching. The words we are looking for are "is" and "it". The pattern we will use for this is i(s|t) which is decomposed in the following way. Clump Interpretation i the letter, "i" s|t either the letter "s", or the letter "t" (note that the parenthesis only indicate the grouping and play no active role in the match) The sentence we will apply this pattern to is "This_is_it." Underscores have been used so that the cursor position can be shown unequivocally in the table. Step Target_String Clump Commentary 1. This_is_it. i The character to the right of A| the cursor does not match. | Since there are more left, advance the cursor and try again from the beginning. 2. T his_is_it. i The same result (and action) A| as in the previous step. | MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 3. Th is_is_it. i This time, the character to A| the right is an "i". Remember | this as the start of a match. Advance the cursor by the length of the pattern (one character), and try the next clump. 4. Thi s_is_it. s|t Remember this as a fallback A| point. Try the alternative on | the left first. 5. Thi s_is_it. s The character to the right is A| an "s". Advance the cursor by | the length of the pattern (one character). Since this is the end of the clumps, the match succeeds. However, there are characters left in the string so restart the match from the beginning (and forget the saved fallback point). 6. This _is_it. i The character to the right A| doesn't match. Advance the | cursor and start again. 7. This_ is_it. i The character matches. A| Remember the start of the | second match in this string. Advance the cursor by the length matched, and try the second clump. 8. This_i s_it. s|t Remember this as a fallback A| point. Try the alternative on | the left first. 9. This_i s_it. s The character to the right is A| an "s". Advance the cursor by | the length of the pattern (one character). Since this is the end of the clumps, the second match succeeds. However, there are characters left in the string so restart the match from the beginning (and forget the saved fallback point again). Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 10. This_is _it. i Try the first clump again. It A| fails to match so we advance | the cursor and start again. 11. This_is_ it. i This time there is an "i" to A| the right of the cursor. | Remember the start of the third potential match. Advance the cursor as before and move to the next clump. 12. This_is_i t. s|t Mark this cursor position as a A| fallback point and try the | clump on the left. 13. This_is_i t. s The character to the right is A| not an "s". The match would | fail if not for the fallback point established in step 12. In this case, it does not involve any readjustment of the cursor. 14. This_is_i t. s|t Having come back to a restart A| position, try to use the | pattern on the right (but remember that there is no further recourse if something fails after this). 15. This_is_i t. t This pattern also succeeds. A| The start and size of the | third successful match is remembered. The cursor is advanced by the size of the match. There are more characters left. The great mandala continues. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 16. This_is_it . i The character to the right of A| the cursor is not an "i". The | cursor is advanced. However, because no more characters are left(1) to the right of the cursor, no further tries will be made. The results returned from this pattern-match attempt will indicate that there were three matches in the target string. They occurred at positions 3, 6, and 9. All were two characters long. 7.2. Recursive Patterns This example demonstrates the use of recursion in pattern-matching. In this case, we will use a simplified form of arithmetic expressions. If we define the arithmetic expression as: 1) a variable name, or 2) a variable name followed by an operator followed by an expression, or 3) an expression enclosed in parentheses, then the regular expression which matches this is given by [abcde][+-*/]&|(&)|[abcde] where the variable names have been simplified to the single characters chosen from the first five letters of the lower case alphabet, and the operators to those of addition, subtraction, multiplication, and division. This pattern is broken down as Clump Interpretation [abcde] a variable name [+-*/] an arithmetic operator ________________________________________ (1) In fact, this match attempt will never be made. The translator will recognize the fact that this pattern needs at least two characters to succeed and pass this information to the run-time. At this point, there aren't two characters left. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 & an occurrence of an expression ( the character, "(" & an occurrence of an expression ) the character, ")" [abcde] a variable name The alternation clumps have been omitted from the discussion as previously. As a test case, we will use the string a+(b*(c)) which is an example of most all the items we have informally mentioned above. To simplify the following discussion, however, the alternations and fallback points will be mentioned in the commentary rather than being given explicitly as separate steps. Step Target_String Clump Commentary 1. a+(b*(c)) [abcde] There is a variable name A| to the right of the | cursor. However, there are two untried possibilities to use if something goes wrong later. Advance the cursor, and try the next clump. 2. a +(b*(c)) [+-*/] There is an operator to A| the right of the cursor. | Advance it by the length of the pattern (one character) and try the next clump. 3. a+ (b*(c)) & Remember this as return A| point from recursion. Do | not advance the cursor, but "call" ourselves to start with the first clump again. 4. a+ (b*(c)) [abcde] This doesn't match; but A| there are alternatives | left to try. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 5. a+ (b*(c)) ( This alternative matches. A| Advance the cursor by the | length of the match (one character) and pass control to the next clump. Remember that there is still one other remaining alternative from step 3. 6. a+( b*(c)) & This is another recursive A| "call" on ourselves. At | this point, we are three-deep in invocations, the other two having been started at steps 1 and 3. 7. a+( b*(c)) [abcde] This one matches. Advance A| the cursor by the length | of the match and try the next clump. Remember that there are two other alternative patterns left untried (besides those pending from earlier). 8. a+(b *(c)) [+-*/] This also matches. Looks A| like we're on a roll, so | advance the cursor again and continue with the next clump. 9. a+(b* (c)) & Once again, we "call" A| ourselves to continue the | pattern. This will make the fourth level of invocation. 10. a+(b* (c)) [abcde] Nothing here like that; A| maybe the next alternative | will have some luck. 11. a+(b* (c)) ( That alternative looks A| fine. Step the cursor and | try the next clump. 12. a+(b*( c)) & Another recursive A| invocation. | Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 13. a+(b*( c)) [abcde] This one matches so we do A| the customary things to | the cursor and list of clumps. However, there are untried things to do if we fail to match later on (hint, hint). 14. a+(b*(c )) [+-*/] This one fails. Restore A| things to where they were | at the point of the last alternative choice, and retry with the next one (if any) in line. 15. a+(b*( c)) ( This is the next A| alternative to try (left | over from step 13). Unfortunately, it also fails. So we try the last alternative available to us. If that doesn't work, we will report failure back to the last untried alternative prior to us, and so on. 16. a+(b*( c)) [abcde] This one matches. We A| advance the cursor by the | length of the match (one character). However, there are no clumps left in the list. Since we were called as a recursive invocation from step 13, we return control there leaving the cursor and match information alone. 17. a+(b*(c )) ) This matches too. The A| cursor is advanced. This | is the end of this portion of the pattern. Control returns to the "caller" at step 9. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 18. a+(b*(c) ) <NONE> There is no successor A| clump to pass control to. | The pattern which called us from step 9 was one which ended with an invocation of "|&". Therefore, the expression ends at this point. The cursor is not changed and this "returns" to its caller at step 6. 19. a+(b*(c) ) ) This is the pattern A| following the "&" invoked | by step 6. It matches, advances the cursor, and (since this is the end of a successful alternative) returns to its caller at step 3. 20. a+(b*(c)) <NONE> The same sequence happens A| here that happened in step | 18. The alternative has successfully matched. The only difference is that the "caller" was actually the run-time support software. It initiated this match attempt with step 1. Since all the clumps have been used, we have had a successful match. Moreover, there are no further characters left in the string, so all processing is finished. The results returned from this match will show a pattern detected starting at position 1 which has a length of 9. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 There are a few general points to be noted about the preceding example and about recursion in general. 1) If untried alternatives exist at the end of a particular "invocation", they will be discarded once that level of recursion returns to its "caller". Untried alternatives from previous levels are still available, however.(1) 2) Recursive expressions are easy to write, and even natural for patterns involving nested constructs. They can easily go awry if the writer does not understand clearly how the matches will be made. 3) The use of labelled substrings within recursive expressions will almost always cause the same character to be associated with more than one substring in a pattern. This is an error as far as the run-time is concerned. 4) Because of the way recursion works, we cannot write a pattern to match <expr> <op> <expr> as "&[+-*/]&" The reason is that the initial "&" causes a left-recursion loop which will be detected by the run-time and reported as an error. ________________________________________ (1) Untried alternatives mimic the action of PL/I condition handlers with respect to recursive procedures. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 8. String versus Line Mode | | The discussion up until now has presumed that the string to be | matched was viewed as an unbroken series of characters. This is | called "string" mode. However, it is possible to view a string | to be examined as a series of "lines", that is, sequences of | characters delimted by ASCII newline characters.(1) By changing | the way the sequence of characters is viewed, rapid searches of | textual data are possible. | | | This other ("line") model of character sequences means that | certain regular expressions will have to perform differently | depending on whether the character sequence is viewed as a single | string, or a series of lines. The following discussion explains | the differences. | | | Item String Mode Line Mode | Ordinary matches a newline if the ditto Character ordinary character itself is a newline | . matches any character matches any character including a newline except a newline | [ ] matches a newline if the ditto newline is included in the set of characters within the brackets | [^ ] matches a newline if a never matches a newline newline is not included in the set of characters within the brackets | ^ matches before the first matches before the first character of the string character of the string, or immediately before any character preceded by a newline | ________________________________________ (1) The entire body of a Multics mail text is an example of this. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 $ matches after the last matches after the last character of a string character of a string, or before a newline in the middle or end(1) of the string ________________________________________ (1) The newline is stepped over in this case, but is not counted as being in the part of the string that is matched. Newlines may only be included in matches by including them as "ordinary characters" in a pattern definition. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 9. Subroutines The following pages describe subroutines available to translate strings which define regular expressions into Multics machine code, and apply the patterns they define to "target" strings. * The include files used as part of the regular expression processing are described in a separate part of this document. All routines described in this section are interruptable and re-entrant. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 ___________________ ___________________ rx_$get_escape_char rx_$get_escape_char ___________________ ___________________ NAME: rx_$get_escape_char This entry returns the present value of the "escape" character for regular expressions. USAGE declare rx_$get_escape_char entry (char(1) varying); call rx_$get_escape_char (escape_char); ARGUMENTS escape_char is the present value of the process-local default escape character. (Output) NOTES: If no default escape character has been established at the time of the call to this entry point, a zero-length string will be returned. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines ___________________ ___________________ rx_$set_escape_char rx_$set_escape_char ___________________ ___________________ NAME: rx_$set_escape_char This entry defines the character to be used as the default "escape" character when one is not supplied in a call to rx_$translate. It also returns the value of the escape character which was set prior to calling this entry. USAGE declare rx_$set_escape_char entry (char(1) varying, char(1) varying); call rx_$set_escape_char (new_escape_char, old_escape_char); ARGUMENTS new_escape_char is the new value for the process-local default escape character. (Input) old_escape_char is the value of the process-local default escape character before the call to this entry. (Output) NOTES: The present value of the escape character may be cancelled by setting new_escape_char to the null string. All subsequent calls to rx_$translate must then explicitly provide an escape character if they wish regular expression meta-characters to be recognized. If no default escape character has been established at the time of the call to this entry point (or it has been cancelled), a zero-length string will be returned. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 ___________________ ___________________ rx_$set_escape_char rx_$set_escape_char ___________________ ___________________ The user should choose a value for the escape character which is not likely to be typed in the normal course of specifying patterns to avoid unexpected results. Choosing a period, for example, is probably a "bad" idea because it occurs frequently and using it as the escape character also prevents its use in matching an arbitrary character. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines _____________ _____________ rx_$translate rx_$translate _____________ _____________ NAME: rx_$translate This subroutine translates a regular expression defining string into an internal form. This internal form can subsequently be applied to other strings by rx_$execute to determine whether they "match" the pattern defined by the regular expression. USAGE declare rx_$translate entry (char(*), char(1) varying, ptr, fixed | bin(35)); | call rx_$translate (rx_string, escape_char, form_p, code); | ARGUMENTS rx_string is the string defining the regular expression pattern. (Input) escape_char is the character which should be considered to be the escape or "lead-in" character for regular expression meta-characters. (Input) * form_p is a pointer to the translated form of the regular expression. (Output) code is a standard status code. (Output) NOTES: If escape_char is passed in as a zero-length string, the character established as the "default" escape character will be used. If no character has previously been established as the default escape character, no regular expression meta-characters will be recognized. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 _____________ _____________ rx_$translate rx_$translate _____________ _____________ Processing of "doubled" escape characters takes precedence over recognition of meta-characters. For example, if "." is chosen as the escape character, then ".." is interpreted as matching a period, not any non-null character. The return value "form_p" can be considered a "ticket" to the | translated form of the regular expression. It will be needed in | subsequent calls to rx_$execute or rx_$search to accomplish the | actual searching of strings. It is valid only for the duration | of the process. It can be voided during execution by calling | rx_$release to return the space it occupies. | If the string defining the regular expression to be compiled is | malformed, or if the translator encounters an internal error, the | return code will be set to the value, | error_table_$translation_failed. In addition, rx_$translate will | signal sub_err_ passing additional information about the location | of the error, if known. This information is described by the | include file, rx_error_info. | MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines _________________ _________________ rx_$translate_old rx_$translate_old _________________ _________________ NAME: rx_$translate_old This entry point performs a translation function identical to that defined by rx_$translate, and returns the same result information. The difference lies in the syntax of regular expressions. This entry point accepts the form of regular expression defined for the "qedx" command.(1) It also recognizes the "c" convention to strip regular expression meta-characters of their special meanings. USAGE declare rx_$translate_old entry (char(*), ptr, fixed bin(35)); | call rx_$translate_old (rx_string, form_p, code); | ARGUMENTS rx_string is the string defining the regular expression pattern. (Input) * form_p is a pointer to the translated form of the regular expression. (Output) code is a standard status code. (Output) ________________________________________ (1) See manual #AG92, Multics Programmer's Manual, Command and Active Functions; "NOTES ON REGULAR EXPRESSIONS" Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 _________________ _________________ rx_$translate_old rx_$translate_old _________________ _________________ NOTES: The structure returned as the result of translating the given regular expression is exactly the same as the one documented by rx_$translate. There is no way that the run-time support routine can know from what form the regular expression derived. * This entry is being provided to allow the replacement of the interpreter presently included in the "search_file_" subroutine. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines ___________ ___________ rx_$execute rx_$execute ___________ ___________ NAME: rx_$execute This subroutine applies the translated form of a previously defined regular expression pattern to a "target" string and returns information about any "matches" which occur. USAGE declare rx_$execute entry (ptr, char(*), ptr, ptr, fixed bin(35), | fixed bin(35)); | call rx_$execute (form_p, string, control_info_p, match_info_p, | match_count, code); | ARGUMENTS form_p is a pointer to the translated form of the regular expression pattern. (Input) string is the string to be tested. (Input) control_info_p is a pointer to a structure, rx_execute_ctl, describing the kind of match to be done and the results desired. (Input) | match_info_p | is a pointer to a structure describing the places in | "string" where the regular expression represented by | "form_p" matched. This data can be selectively returned to | the user under control of the options expressed in the | structure pointed to by "control_info_p". (Input/Output) | | match_count | is the number of times the regular expression represented by | "form_p" matched the string given by "string". (Output) | code is a standard status code. (Output) Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 ___________ ___________ rx_$execute rx_$execute ___________ ___________ NOTES: The input argument, form_p, identifies the location of the translated regular expression. * The information specifying whether labels are desired as part of the returned info; whether one, all, or a bounded set of matches is desired; and similar information is contained in the structure described by rx_execute_ctl.incl.pl1. Upon completion of the routine, information about the places in | the string where the pattern matched is returned to the user. | This information is contained in the segment pointed to by | match_info_p and is described by the structures contained in the | include file, rx_match_info.incl.pl1 (q.v.). | If the caller specifies that details on the match should be | returned, then a temp segment will be allocated to hold that | information. A pointer to the segment will be returned in | match_info_p. It may be returned by passing it to rx_$release. | As an efficiency consideration, however, this segment may be | re-used by passing it to subsequent calls to rx_$execute thus | saving the time of disposal and re-allocation. | | | Errors which occur during the attempted application of the | translated pattern to the target string will be reported via | calls to sub_err_ so that additional explanatory text can be | given. | | | All non_zero status returns, save one, from the call to | rx_$execute are fatal. In the fatal cases, the contents of the | match segment are indeterminate. A returned status value of | error_table_$noalloc is indicative of the single non-fatal case. | It signifies that the match segment was not long enough to | contain all the match results. In this case, the returned match | data are consistent and properly threaded; and in most instances, | application may begin again later on the substring after the last | match described with no ill effect. | MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines ___________ ___________ rx_$release rx_$release ___________ ___________ NAME: rx_$release | | | This subroutine de-allocates the storage obtained for various | return arguments from the rx_$translate and rx_$execute | entrypoints. | | | USAGE | | | declare rx_$release entry (ptr, fixed bin(35)); | | call rx_$release (object_p, code); | | | ARGUMENTS | | | object_p | is a pointer to some object returned by one of the other | rx_$XXX entrypoints. It can, for example, be the translated | regular expression form returned by rx_$translate. | (Input/Output) | | code | is a standard status code. (Output) | | | NOTES: | | | If "object_p" is a null pointer upon entry, then this routine | will set the status code to zero and return. | | | If "object_p" legitimately points at one of the items returned by | an rx_$XXX entrypoint, then the storage occupied by that object | will be de-allocated and "object_p" will be set to null. | | | In all other cases, the error status, | error_table_$unimplemented_version, will be returned and nothing | will be done to the input pointer value or the locations it | references. | Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 __________ __________ rx_$search rx_$search __________ __________ NAME: rx_$search | | | This subroutine searches for the first occurrence of a regular | expression in a given string. Either the entire string or a | substring may be examined for the pattern. | | | USAGE | | | declare rx_$search entry (ptr, char(*), fixed bin(21), fixed | bin(21), fixed bin(21), fixed bin(21), | fixed bin(35)); | | call rx_$search (form_p, string scan_start, scan_length, | match_start, match_length, code); | | | ARGUMENTS | | | form_p | is a pointer to the translated form of the regular | expression. (Input) | | string | is the string to be searched. (Input) | | scan_start | is the position within "string" where searching for the | pattern may commence. (Input) | | scan_length | is the number of characters, beginning at "scan_start", | which may be examined for the occurrence of the pattern. | (Input) | | match_start | is the index position relative to the beginning of "string" | where the first match occurred. (Output) | | match_length | is the number of characters in "string", beginning at | "match_start", which were included in the pattern-match. | (Output) | MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines __________ __________ rx_$search rx_$search __________ __________ code | is a standard status code. (Output) | | | NOTES: | | | If "scan_length" is given as the value -1, it is taken to mean | the rest of the string from "scan_start" to "length(string)". | | Other than using -1 for "scan_length", all substrings defined via | the use of "scan_start" and "scan_length" must be wholly | contained within "string" itself. | | | If the pattern does not occur within the string specified, both | "match_start" and "match_length" will be set to zero. | Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 ______________ ______________ rx_$substitute rx_$substitute ______________ ______________ NAME: rx_$substitute This entry implements the algorithm for modifying a string given a set of match data derived from application of a regular expression to the string and a change specification. In subroutine terms, it performs like the qedx "substitute" (s) command. USAGE declare rx_$substitute entry (char(*), ptr, char(1) varying, char(*), char(*), fixed bin(21), fixed bin(35)); call rx_$substitute (input_string, match_data_p, escape_char, change_spec, output_string, output_string_used, code); ARGUMENTS input_string is the string to which the regular expression was applied. (Input) match_data_p is a pointer to the match data structure, rx_match_info, returned from the call to rx_$execute. (Input) escape_char is the character which can be used to override the default escape character in recognizing references to labelled substrings in the match data. (Input) change_spec is a string which describes what to substitute for each of the matched portions of "input_string". (Input) output_string is the place where the transformed string is to be stored. (Output) MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines ______________ ______________ rx_$substitute rx_$substitute ______________ ______________ output_string_used is a value that indicates how long the resultant string was. It may indicate a value longer than the length of "output_string" (see NOTES below). (Output) code is a standard status code. (Output) NOTES: The algorithm used to affect the transformation of the input string according to the change specifications is as follows: A) The internal escape character is set to the value of "escape_char". If this is a zero-length string, then the default value returned by rx_$get_escape_char is used. B) The input string is broken down into a series of substrings. The criteria for the break is that each piece is either wholly described by one of the elements in the match information (rx_match_data_template), or does not intersect any of the matched parts. C) Once this has been done, the substrings of the input string are examined one at a time from left to right. Each is processed according to the rules described by the remainder of this section. D) If this substring is not part of a successful match, it is copied to the output string as is. The next substring to the right is then examined. E) Otherwise, each character of the change specification string is examined in turn. If it is not equal to the internal escape character, the change specification character itself is copied to the output string. And the next character of the change specification is examined. F) If the current change specification character equals the | internal escape character, the following character is | checked to see if it a second occurence of the escape | character. If this is so, a single instance of the escape | character is copied to the output string. | Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 ______________ ______________ rx_$substitute rx_$substitute ______________ ______________ G) Otherwise, if the current change specification character | equals the internal escape character, the following | character is checked to see if it is a valid label character for regular expression sub-matches. H) If it is valid, a check is made to see if a "label" by that name was defined during the match. If so, the substring of the input string matched by the label is copied to the output string. If not, the label character is ignored; i.e., nothing is copied to the output string. In either case, the next change specification character is subsequently examined. If both the input escape character supplied to this routine and the default escape character obtained from rx_$get_escape_char are zero-length strings, no labels will be recognized in the change specification. It is an error to end "change_spec_string" with the escape character. Under the rules outlined above, a pair of escape characters will | always cause a single escape character to appear in the output | string, even if there was a labelled substring which was "named" | by the escape character. | The return value, "output_string_used", will contain the length of the transformed string in all cases. When "output_string" is too small to contain the result, "code" will be set to the value error_table_$smallarg and "output_string_used" will have the size needed to fully contain the result. Otherwise, the result is in substr(output_string, 1, output_string_used). MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines _____________________ _____________________ rx_$get_newline_count rx_$get_newline_count _____________________ _____________________ NAME: rx_$get_newline_count | | | This functions returns the maximum number of lines that a | translated regular expression will match if it applied to a | target string in "line mode". If the expression would match an | arbitrary number of lines, then this function returns the value, | -1. | | | USAGE | | | declare rx_$get_newline_count entry (ptr) | returns (fixed bin(21)); | | nl_cnt = rx_$get_newline_count (form_p); | | | ARGUMENTS | | | form_p | is a pointer to the translated form of the regular | expression pattern. (Input) | | | NOTES: | | | If the argument supplied is null, or does not point to a version | of the translated regular expression, then this routine will | signal the sub_error_ condition by calling the sub_err_ routine. | The text of the sub_error_info structure will contain the reason | for the error. Upon return from the signal, this function will | return the value, -1. | Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 _____________________ _____________________ rx_$sub_error_handler rx_$sub_error_handler _____________________ _____________________ NAME: rx_$sub_error_handler | | | This entry provides a simple, common way of handling the | sub_error_ condition raised when errors are detected in various | rx_ routines. It provides for obtaining the most recent rx_ | error frame, interpreting the information supplied, and | displaying it via standard output mechanisms. | | | USAGE | | | declare rx_$sub_error_handler entry (entry options(variable), | char(*), fixed bin(35)); | | call rx_$sub_error_handler (error_report_proc, callers_name, | code); | | | ARGUMENTS | | | error_report_proc | is a procedure which can be called to display the error | information to the user (see NOTES below). (Input) | | callers_name | is the name of the routine on whose behalf the condition | handling is taking place. (Input) | | code | is a standard status code. (Output) | | | NOTES: | | | This entry assumes that it is being called as the result of a | sub_error_ condition signalled by an rx_ routine. It searches | backward on the stack for the most recent, appropriate condition | frame and interprets the information supplied. It then calls the | routines given as the value of "error_report_proc" passing the | interpreted error information. | MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines _____________________ _____________________ rx_$sub_error_handler rx_$sub_error_handler _____________________ _____________________ This routine assumes that the "error_report_proc" has the same | calling sequence as com_err_ and its siblings: | com_err_$suppress_name, active_fnc_err_, | active_fnc_err_$af_suppress_name. This routine calls | "error_report_proc" passing it the status code supplied in the | call to sub_err_, the value given as "callers_name" as the name | of the reporting entity, and ioa_ control strings and parameters | describing the error condition. | | | The return status code from this entry indicates whether the | transmission of the error information to the user was successful. | It is zero if it was. The code, | error_table_$action_not_performed, will be returned if the frame | could not be located; and error_table_$unimplmented_version will | be returned if the frame was present but the "info_ptr" given to | sub_err_ did not point to a valid instance of rx_error_info. One | other source of non-zero return codes is find_condition_info_ | itself. | Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 10. Commands and Active Functions The following pages describe several commands and active functions built on the regular expression subroutines documented earlier. They allow manipulation and testing of command-line arguments, generalized selection, and control over the default escape character. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines ______________ ______________ rx_escape_char rx_escape_char ______________ ______________ SYNTAX AS A COMMAND: rx_escape_char {value} SYNTAX AS AN ACTIVE FUNCTION: [rx_escape_char {value}] FUNCTION: sets/prints/returns the default escape character for regular expression processing. ARGUMENTS: value gives the character to be used as the default regular expression escape character. NOTES: Whether or not the first argument is supplied, the present escape character will be printed (command) or returned (active function). Unless "value" is supplied, however, no attempt to modify the character saved as the default will be made. If the string supplied for "value" is longer than one character, the first character of the argument will become the escape character. In the case of command invocation, this instance will be noted as a warning. The default escape character can be cancelled by supplying the null string as the value of the argument. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 __ __ rx rx __ __ SYNTAX AS A COMMAND: rx opname regexp string1 ... stringN SYNTAX AS AN ACTIVE FUNCTION: [rx opname regexp string1 ... stringN] FUNCTION: translates the regular expression given as the second argument and, depending on the opname given as the first argument, "applies" the translated expression to the following strings and returns a result. ARGUMENTS: opname designates a list of possible operations to perform on the the strings given as arguments. Opnames permitted, followed by their alternate forms where applicable are: change, chg change_all, chga change_some, chgs count, ct exclude, ex | exclude_index, exi | include, in | include_index, ini | match_count, mct test See "List of Operations" below for a description of each opname, its syntax, and specific application. Operations are arranged in the same order as above. regexp is a string which defines a regular expression. The syntax of | this string is the syntax governing input strings to the | entrypoint, rx_$translate. | stringi can be one or more arguments depending on the particular operation to be performed. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines __ __ rx rx __ __ LIST OF OPERATIONS: Unless otherwise noted, all errors detected by a command invocation will be reported via com_err_; active function | invocations will report their errors via calls to active_fnc_err_. OPERATION: change SYNTAX AS A COMMAND rx change regexp change_spec string {strings} SYNTAX AS AN ACTIVE FUNCTION [rx change regexp change_spec string {strings}] FUNCTION: applies the regular expression given as the second argument to its fourth and following arguments. Those arguments which the regular expression matches are transformed according to the specifications given as the third argument and returned. Arguments that the regular expression does not match are returned unaltered. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 __ __ rx rx __ __ NOTES: The transformation of arguments takes place according to the following rules.(1) For purposes of this example, the regular expression escape character is assumed to be the backslash character, "". A) each occurrence of "regexp" in "strN" is replaced by "change_spec". B) if a construct of the form "{regexp2}X" is evaluated as part of the resulting pattern-match, then any occurrence of "X" in "change_spec" is replaced by the string matched by "regexp2" during rule A above. C) the character, "&", is predefined to be a "label" for the characters matched by "regexp". Its occurrence in "change_spec" is treated as in rule B above. D) preceding any character in "change_spec" by a "c" will cause that character to be copied literally during replacement, thus negating the effects of rules B and C. These rules imply that the "change" operation functions in a | manner which is similar to the "substitute" command(2) of the | qedx editor; regexp plays the role of the left-hand side, and change_spec mimics the right-hand side. Each of the argument stringN's are analogues of lines in a text buffer. If "change_spec" is a null string, any occurrence of "regexp" in each argument will be deleted. ________________________________________ (1) These are the same actions as implemented by the subroutine, rx_$substitute. In addition, if the default escape character is not one of the characters "{", "}", or "&", the user's regular expression is augmented to be preceded by "{" and followed by "}&" before it is translated. This allows easy reference to the entire matched string by the change specification. (2) The actions which take place are similar. The syntax of the regular expression is that of rx_$translate, not that of rx_$translate_old (or qedx). MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines __ __ rx rx __ __ As an example, the command line rx change "^f.{..*}X$" X.fortran [segs **] | will change all two-component names whose first component is "f" to names comprised of the second and succeeding components suffixed by ".fortran". All other names are returned asis. Thus, "f.foo.source" becomes "foo.source.fortran"; but nothing happens to "x.pl1". OPERATION: change_all SYNTAX AS A COMMAND rx change_all regexp change_spec string {strings} SYNTAX AS AN ACTIVE FUNCTION [rx change_all regexp change_spec string {strings}] FUNCTION: applies the regular expression given as its second argument to its fourth and following arguments. Those arguments which the regular expression matches are transformed according to the specifications given as the third argument and returned. This functions returns an error diagnostic if any argument fails to match the regular expression. NOTES: The rules for transforming arguments are the same as for "rx change" (q.v.). Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 __ __ rx rx __ __ OPERATION: change_some SYNTAX AS A COMMAND rx change_some regexp change_spec string {strings} SYNTAX AS AN ACTIVE FUNCTION [rx change_some regexp change_spec string {strings}] FUNCTION: applies the regular expression given as its second argument to its fourth and following arguments. Those arguments which the regular expression matches are transformed according to the specifications given as the third argument and returned. Arguments that the regular expression does not match are returned unaltered. This function returns an error diagnostic if no argument can be matched by the regular expression. NOTES: The rules for transforming arguments are the same as for "rx change" (q.v.). MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines __ __ rx rx __ __ OPERATION: count SYNTAX AS A COMMAND rx count regexp string {strings} SYNTAX AS AN ACTIVE FUNCTION [rx count regexp string {strings}] FUNCTION: applies the regular expression given as its second argument to its third and subsequent arguments. On an argument-by-argument basis, it returns "1" if the regular expression matched at least once in that argument, and "0" otherwise. OPERATION: exclude SYNTAX AS A COMMAND rx exclude regexp string {strings} SYNTAX AS AN ACTIVE FUNCTION [rx exclude regexp string {strings}] FUNCTION: applies the regular expression given as its second argument to its third and subsequent arguments. It returns as its result only those arguments which do not "match" the regular expression. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 __ __ rx rx __ __ OPERATION: exclude_index SYNTAX AS A COMMAND rx exclude_index regexp string {strings} SYNTAX AS AN ACTIVE FUNCTION [rx exclude_index regexp string {strings}] FUNCTION: applies the regular expression given as its second argument to its third and subsequent arguments. It returns as its result the indices of those arguments which do not "match" the regular expression. The first string in the list is considered to have an index value of 1. OPERATION: include SYNTAX AS A COMMAND rx include regexp string {strings} SYNTAX AS AN ACTIVE FUNCTION [rx include regexp string {strings}] FUNCTION: applies the regular expression given as its second argument to its third and subsequent arguments. It returns as its result only those arguments which "match" the regular expression. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines __ __ rx rx __ __ OPERATION: include_index SYNTAX AS A COMMAND rx include_index regexp string {strings} SYNTAX AS AN ACTIVE FUNCTION [rx include_index regexp string {strings}] FUNCTION: applies the regular expression given as its second argument to its third and subsequent arguments. It returns as its result the indices of those arguments which "match" the regular expression. The first string to be tested is considered to have an index value of 1. OPERATION: match_count SYNTAX AS A COMMAND rx match_count regexp string {strings} SYNTAX AS AN ACTIVE FUNCTION [rx match_count regexp string {strings}] FUNCTION: applies the regular expression given as its second argument to its third and subsequent arguments. It returns as its result the number of times the regular expression "matched" in each argument on an argument-by-argument basis. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 __ __ rx rx __ __ OPERATION: test SYNTAX AS A COMMAND rx test regexp string {strings} SYNTAX AS AN ACTIVE FUNCTION [rx test regexp string {strings}] FUNCTION: applies the regular expression given as its first second to all its subsequent arguments. On an argument-by-argument basis, it returns "true" if the pattern defined by the regular expression appears in the given argument, and "false" otherwise. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 11. Include Files This section contains the declarations and/or descriptions of several include files available to the caller of rx_. * 11.1. rx_execute_ctl.incl.pl1 This structure describes the parameters which control the pattern match attempt and provide information on what kind of results are desired. dcl 1 rx_execute_ctl aligned based, 2 version char (8) aligned, 2 select aligned, * 3 restriction fixed bin(35) aligned, * 2 flags aligned, 3 match_detail_wanted bit(1) unaligned, | 3 ignore_labels bit(1) unaligned, 3 restriction_is_bound bit(1) unaligned, * 3 restriction_is_index bit(1) unaligned, 3 line_mode bit(1) unaligned, | 3 PAD bit(31) unaligned; | where: version is the identifier for this version of the structure. * restriction is a value which provides for the selection of a subset of all possible matches found in the target string. The interpretation of this value depends upon the setting of the flag variables. * Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 match_detail_wanted is a flag which indicates that the locations and lengths of each match should be placed in a segment and the address of that segment should be returned to the user. In the description of rx_$execute, this is the argument named, match_info_p. ignore_labels when set to "1"b indicates that any labelled substring information provided by compiled code for "{ ... }X" sequences be left out of the match information structure. * restriction_is_bound when set to "1"b indicates that the value of restriction should be taken as an upper bound on the number of matches reported in the match information segment. No more than "restriction" instances of the rx_match_info structure will be placed in the segment. Less than this amount may actually be returned, however. restriction_is_index when set to "1"b indicates that only information on the N-th successful pattern match be returned. line_mode | when set to "1"b indicates that the string to be searched | for patterns is to be treated as if it were made up of a | series of lines separated by newlines, rather than a single | string of characters. The newlines in this case are treated | as delimiters for a sequences of strings. | PAD is unused. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 11.2. rx_match_info.incl.pl1 These structures describe the information on where matches have been made in the target string. dcl 1 rx_match_info aligned based, 2 version char (8) aligned, 2 template_count fixed bin (35) aligned, 2 PAD bit (36) aligned, 2 first_rx_match_data ptr aligned; | dcl 1 rx_match_data_template aligned based, 2 link aligned, 3 next ptr aligned, 2 match aligned, 3 start fixed bin(21) aligned, 3 length fixed bin(21) aligned, 2 label_count fixed bin(17) aligned, 2 label aligned, 3 name (0 | refer(rx_match_data_template.label_count)) | char(1) unaligned, 3 start (0 | refer(rx_match_data_template.label_count)) | fixed bin(21) aligned, 3 length (0 | refer(rx_match_data_template.label_count)) | fixed bin(21) aligned; where: version is the identifier for the current version of this structure. template_count is the number of instances of rx_match_data_template which have been placed in the segment. It is included here to make the match data self-describing. PAD is presently unused and should be set to ""b. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 rx_match_data is the pointer to the first (if there were any matches at | all) of a series of linked structures describing the patterns found. next is a pointer to the next instance of this structure in the match info segment. It has the value null() if this instance is the last in the list. start is the index from the beginning of the string to the start of the match. length is the number of characters which matched the pattern on this attempt. label_count is a count of the number of labelled substrings which were identified during this pattern match attempt. This value may be forced to zero by setting rx_execute_ctl.flags.ignore_labels. name is the character which was specified to be used as the label for this substring. start is the index from the beginning of the target string to the first character of the substring. length is the number of characters in the named substring. NOTES: Match data is returned as a starting location and a length. To clarify the notion of starting location, the following model is used. Given a string of length n, there are n+1 possible starting locations for a string match, numbered as follows: 1. before the first character of the string 2. between the first and second characters of the string 3. between the second and third characters of the string MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines ... n. between the next-to-last and the final characters of the string n+1. after the last character of the string The length of the match is given as the number of characters matched. Thus, a match or labeled substring which is identified as starting at location 2 with a length of 0 (zero) specifies the null string between the first and second characters. Match information is returned in order from the leftmost match to the rightmost. Note, however, that two succeeding instances of rx_match_info may specify the same starting location depending on the regular expression being used(1) for the match attempt. Unlike match attempts which proceed from left to right, the order of production of labeled substring data is implementation dependent. The user is cautioned against making any assumptions about the order in which the labels are returned. ________________________________________ (1) Consider that the expression "^|." will return two instances of rx_match_info starting at location 1 in the string "A". One will have a length of zero, the other will have a length of one. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 | | 11.3. rx_error_info.incl.pl1 | | Whenever possible, the rx_subroutine reporting an error via a | call to sub_err_ will supply information to assist the user in | locating the source of the error. This takes the form of a | formatted string, and a position withing the string where the | error is. The formatted string is pre-processed so that terminal | dependencies are edited out; e.g. backslashes are doubled for | printing, control and non-ASCII characters are converted to a | "nnn" representation. The error position reflects the error | position within this edited string. | | | A pointer to this structure is supplied as the information | pointer on the call to sub_err_. | | | dcl rx_error_info_string_len fixed bin(21); | | dcl 1 rx_error_info aligned based, | 2 version char(8) aligned, | 2 string aligned, | 3 len fixed bin(21), | 3 error_pos fixed bin(21), | 3 text char (rx_error_info_string_len | refer (rx_error_info.string.len)); | | | where: | | | version | is the identifier for the current version of this structure. | | len | is the length of the edited display string. | | error_pos | is the position, relative to the start of the string, where | the error occurred. When error_pos = 0, the error reported | applies to the entire string. | | text | is the edited display string. | MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 12. Comparative Timings This section provides some timing comparisons for the functions of translation and execution between the regular expression facility described here and that presently provided by search_file_.(1) | | The timings represented here are those of the software as | described by the first publication of the MTB. They differ | slightly, but not substantially, from those of later revisions. | 12.1. The Test Environment The overall setup of each test is as follows: 1) All target strings were built from randomly generated strings of digits or alphabetic characters. The random number generator "seed" was set at the beginning of the tests to insure repeat results. 2) Only the time used from the beginning of the translation (or execution) to the return from the routine has been accumulated. No processing times for the returned data are included. All times reported are in milliseconds. 3) For each test, the translation routine was called multiple times, and the results averaged. The same was done for the execution routine. In both cases, the first timing run data was discarded. 4) In all tests, the number of strings which were matched was the same by both routines. ________________________________________ (1) The search_file_ interface does not provide for separate translation and execution. Translation was simulated by applying the pattern to a zero-length string; execution by making use of the "last expression" feature. Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 12.1.1. Test 1 Objective: match an entire string of characters. Regular expression: "^.*$" Strings: 1 Length (min): 100 Length (max): 100 Length (average): 100 Total Trials: 25 Total Matches: 25 rx_ search_file_ ratio Compilation 10.100 0.328 30.79 Execution 0.330 0.504 0.65 12.1.2. Test 2 Objective: match the last character in a string of random length. Regular expression: ".$" Strings: 100 Length (min): 1 Length (max): 99 Length (average): 50 Total Trials: 25 Total Matches: 2475 rx_ search_file_ ratio Compilation 8.937 0.406 22.01 Execution 38.512 323.394 0.12 MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 12.1.3. Test 3 Objective: find all runs of 9's in digit string of random length. Regular expression: "99*" Strings: 100 Length (min): 1 Length (max): 98 Length (average): 50 Total Trials: 25 Total Matches: 11300 rx_ search_file_ ratio Compilation 8.596 0.271 31.72 Execution 100.374 119.587 0.84 12.1.4. Test 4 Objective: find runs of ordered sequences of digits. Regular expression: "0.*1.*2.*3.*4.*5.*6.*7.*8.*9" Strings: 100 Length (min): 1 Length (max): 97 Length (average): 49 Total Trials: 25 Total Matches: 500 rx_ search_file_ ratio Compilation 41.026 1.221 33.60 Execution 1,867.058 2,697.056 0.69 Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 12.1.5. Test 5 Objective: find all strings of random characters ending in "Zz". Regular expression: "Zz$" Strings: 500 Length (min): 1 Length (max): 160 Length (average): 80 Total Trials: 25 Total Matches: 25 rx_ search_file_ ratio Compilation 9.515 0.231 41.19 Execution 120.327 100.171 1.20 12.2. General Commentary on the Tests As should be obvious, the translation times for the new method are significantly longer than those of search_file_. This is primarily due to the increased complexity which the translator must be prepared to handle. Furthermore, there are optimizations in both the lexical scanning and code generation phases which are performed to speed up the execution. This is done on the expectation that translation will occur far less often than execution. Since the storage of the compiled code is under control of the caller, it is possible to provide additional guarantees that this is so by means of an "n"-level LRU cache of expression strings and translated equivalents. In four out of the five tests done, the execution speed of the new method was faster. The exception has been analyzed and it is slower because, again, there is additional mechanism present to support the additional information-return possibilities. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 12.3. Comparison Against A Compiled Program Research, using an earlier implementation of a facility like this, was done comparing its performance against PL/I programs specially written for specific purposes. A description of the work was published in the Proceedings of HLSUA Forum XXXIV.(1) The timing studies indicate that, at best, the hand-tailored program offered a 30% performance advantage while taking about 300% longer to write and debug. ________________________________________ (1) April 4-7, 1982; pp. 276-291 Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 13. Miscellaneous Topics This section contains various topics related to the subject of the regular expression facility described in this document which do not logically fit in the exposition of other sections. 13.1. Use of National Characters The facility described in this document employs many characters from the ISO International Alphabet #5 which are designated as "national" characters. In view of the evolving consensus away from the use of such characters in Multics' text processors, a few words about their use here is appropriate. The regular expressions implemented by this facility employ "national" characters for two reasons. First, they are considered as part of the Multics standard character set.(1) Second, it is difficult (if not impossible) to provide the required functionality in such a compact form without the use of these, or other, "national" characters. The author is open to suggestions and comments which will rectify this situation. 13.2. Installing it in the System By use of the entry point rx_$translate_old, this facility could be used in a re-written "search_file_" to provide equivalent functionality through an existing interface. In addition, "search_file_" could also be enhanced to provide a multi-level cache of the "n" most recently used regular expressions, rather than the single level provided now. Users could select which type of processing they wanted "old" vs. "new" via an interface command or through a process-global switch. ________________________________________ (1) See MPM Reference Manual, AG91-03 Appendix A, (in draft form at the time of this writing) which defines the characters of Multics to be those of ANSI X3.4-1968. MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines 13.3. Starname Processing Given the facility described here, it is possible to provide a simple transformation from Multics "starnames" into regular expressions. This would enable a single facility to underlie both mechanisms. An informal set of rules which describe this transformation are listed below. For discussion purposes, the escape character is assumed to be the accent grave character, "`". 1) All occurrences of the escape character are doubled so that they match themselves literally. 2) Any "?" is transformed into "`.". 3) The sequence "**" is changed to be "`.`*". 4) The single "*" occurrence becomes "`[`^.`]`*". 5) All other characters are copied as is. 6) After the pattern is constructed, it is prefixed by "`^" and suffixed by "`$". Of course, this transformation assumes that the starname was well-formed to begin with. However, syntax checking can be incorporated as part of the transformation process. An advantage of this method is that the starname syntax may later be relaxed to allow items like "*foo*" with much less effort than re-doing the present finite-state machine program. | | 13.4. Efficiency Considerations | | Despite attempts to make the translated form of the regular | expression run as quickly as possible, there are still some | things that the writer of the regular expression can do to help | out. This section discusses some of the optimizations that the | translator does and gives some tips on what the user can do to | avoid causing the execution to be slowed. | Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 13.4.1. Translator Optimizations | | In order to provide for efficient pattern matching, the | translator does a number of optimizations (described earlier as | building "clumps"). In this way, it can generate more effective | code than would otherwise be the case if the syntax definition | definition were followed slavishly. The semantics of the | optimization are identical with that in the formal definition, | however. | | | In the following discussion, the escape character is assumed to | be the accent grave (`) character. | | | -- A single character set membership test followed by | indefinite repetition is treated as indefinite repetition | of the character. That is, "`[X`]`*" is translated as | "X`*". | | -- Strings of "`."s are combined into a single test for the | specified number of characters remaining to be matched. | | -- Strings of "`."s followed by a repetition are adjusted to | include the length of the "`."-string less one in the | bounds. For example, "`.`.`.`.`*" becomes "`.`<3:`>"; | and "`.`.`.`.`.`<:6`>" is translated as if it were | "`.`<4:10`>". | | -- Strings of ordinary characters are combined into single | string comparisons. | | -- Set membership ("`[ `]") and set exclusion ("`[`^ `]") | strings are transformed internally into tables suitable | for use with the hardware TCT (Test Character and | Translate) instruction. | | -- Notice is taken of initial character strings which can be | used to find places where the scan can begin in the | target strings. Expressions like "define`|declare" will | only be profitably begun in target strings which have | sequences of the letters "de". | | -- The translator remembers the minimum and maximum lengths | which the pattern needs to succeed and passes this to the | execute phase. The pattern "AB`.`.`*C" needs at least a | four-character target string to succeed. | MTB-660 Multics Technical Bulletin Revision 1 Regular Expression Routines -- Items which are indefinitely repeated, and which are | followed by character string patterns, will automatically | do "backscanning" to find effective places to resume. | For example, in "`.`*AB", the "`.`*" matches the entire | remainder of the string. Then it backs up to the | rightmost occurrence of "AB" before proceeding. This | cuts down on excessive backtracking at run-time. | | -- When building the parse tree, notice is taken of | "anchoring" of the pattern at either end. And this is | propagated to the root node. Thus, "`^ABCDE`|`^VWXYZ" is | just as effective as "`^`(ABCDE`|VWXYZ`)". The same | applies to the use of "`$". | | | 13.4.2. Sequences That Work Against Efficiency | | Despite the attempts at producing efficient run-time code, there | is a limited amount of time and effort that the translator will | spend achieving this goal. While the assumption is that the | translated pattern will be applied many more times than it is | translated, it is also assumed that the source of many of the | strings given to the translator will be a human user. Therefore, | speed in translation also must be considered. The following list | of situations result in slower execution speeds because the | translator is not yet good enough or fast enough to resolve them | into efficient equivalent sequences. | | | As above, the escape character used in the illustrations will be | accent grave. | | | -- Items enclosed in parentheses, which are followed by | repetition forms, are translated into longer, more | general, and therefore slower code sequences. For this | reason, "X`*" is preferable to "`(X`)*" even though they | have equivalent semantics. | | -- The use of alternation on single characters results in | slower execution than set membership tests. Thus, | "`[XYZ`]" executes faster than "X`|Y`|Z". | | -- Recursion is slower than repetition. The pattern, "A`*", | will run faster than "A`&". | | -- In general, recursion upsets the anchoring and target | string length analysis due to the difficulty of doing | efficient flow analysis. Therefore, it executes more | slowly. | Multics Technical Bulletin MTB-660 Regular Expression Routines Revision 1 13.5. Implementation Restrictions | | The following items are general notes about details of the | implementation which restrict in some way the extent of the | translator or execution domains. | | | -- The translator does all of its space allocation in a | single segment area for efficiency. This is not expected | to cause problems since it requires several thousand | "clumps" before this limit is broached. Most users' | regular expressions have on the order of a dozen or less. | | -- There is a limit to the amount of backtracking | (checkpoint) information which can be stored at execution | time. A checkpoint is taken when alternation ("`|" or | "`|`|") is done, when repetition is used ("A`*"), and | when recursion is performed ("A`&") at execution time. | There is room for about 32,000 checkpoints per | pattern-match attempt. | | -- Since the match data is returned in a single segment, | there is a limit to the number of times a given pattern | will be reported as matching a particular target string. | This limit is also affected by the number of "labelled | substrings" which are defined. If all ninty-four label | characters are used in each match attempt, then | approximately 900 matches can be reported back to the | user. If no label characters are used (or if they are | specifically ignored), then slightly more than 52,000 | matches will be reported. | | -- Because of limits on the nature of the data available to | the run-time regarding "long" repetition as in | "`(ABCDE`)`<10:20`>" where the parentheses interfere with | optimization, recursion cannot be used in such | combinations. Thus, the translator will refuse to accept | "`(/`&/`)`<14:16`>". Simpler repetition, as with | "X`<6:7`>, can still be used with recursion, however. | | The exact conditions where this will cause a conflict are | difficult to delineate exactly. Furthermore, this is not | expected to cause most users a problem. |