17.1 Untyped, Slow Regular Expressions

7.4.0.4

17.1 Untyped, Slow Regular Expressions

Our first language design consists of five different kinds of regular expressions that can be combined with a string to perform a match. This section explains the constructs in the new language, discusses an implementation strategy, and offers some test cases.

Your task is to first read and understand the specification, then stub out an implementation so that the provided test cases all run (and, presumably, fail). Then, write 10 new test cases based on your understanding of the specification. Finally, try to work out an implementation that passes all of the tests.

There are a number of shortcomings in this language’s design and implementation and subsequent parts of the exercise address them.

17.1.1 Regular Expression Language Specification

syntax
(r^define regexp-id r)

Binds regexp-id to the regular expression r.

syntax
(r^match r string-expr)

Combines a regular expression (r) with a string, performing a match on the whole string. Each of the different forms of r determine which characters in the string constitute a valid match for the regular expression and what matching returns.

The grammar of regular expressions is as follows:

r	=	(r^range regexp-range-char regexp-range-char)
	\|	(r^or r ...)
	\|	(r^seq r ...)
	\|	(r^* r default-expr)
	\|	regexp-id

regexp-range-char

any literal character except -, [, ], or ^

The five alternatives have the following semantics:

syntax
(r^range regexp-range-char regexp-range-char)

Matches exactly one character that is lexicographically between the first and second argument (inclusive).

For example, (r^range #\a #\d) matches "b" and returns the result #\b and it does not match the string "bb" or the string "z".

syntax
(r^or r ...)

Matches any string that any of the rs match. Returns the value of the left-most one that matches.

For example, (r^or (r^range #\a #\d) (r^range #\p #\r)) matches the string "b" (returning #\b) and also matches the string "p" (returning #\p).

syntax
(r^seq r ...)

Matches a string that matches each of the rs, in order. Returns the results of each of the matches of the rs, in an immutable vector.

For example, (r^seq (r^range #\a #\d) (r^range #\p #\r)) matches the string "bq", returning (vector-immutable #\b #\q). It also matches "ar", returning (vector-immutable #\a #\r).

syntax
(r^* r default-expr)

Matches any string that matches r zero or more times in sequence, returning the value of the last match or, if there are zero matches, returns the value of default-expr.

For example, (r^* (r^range #\a #\c) #\d) matches "ac" (returning #\c), matches "ba", returning #\a, and matches "" returning the default expression’s value, #\d.

It would be nice if r^* returned a list of all of the matches instead of just returning the last one; this improvement is the subject of Improving Repeat to Produce Lists.

syntax
regexp-id

Refers to a variable bound by r^define.

17.1.2 Implementation Strategy Part 1: r^match Compile Time

A good strategy for implementing this language is to break the task into three parts. The first part handles the compile-time aspects of r^match (ignoring r^define), and the second part handles the run-time aspects of r^match and the third part adds r^define.

The remainder of this section discusses the first part. Implementation Strategy Part 2: r^match Run Time discusses the second part and Implementation Strategy Part 3: r^define discusses third part.

At a high-level, the compile-time compilation strategy simply translates each different form of regular expression into its own function call (and the functions and how they work are explained in the nest section). There are a few subtle points about the setup, however, that are worth pointing out.

The regular expression forms are not general-purpose extensions to Racket, meaning they can appear only inside specific positions and cannot be comingled with other expression. For example, this is not syntactically legal:

(r^seq (if (tuesday?)
           (r^range #\a #\b)
           (r^range #\c #\d))
       (r^range #\x #\z))

To achieve this effect, we design a helper, internal macro named to-r^ that both r^define and r^match wrap around their r arguments. It matches on the various regular expression forms and translates each one, recursively pushing itself into the subforms. If to-r^ encounters an expression it does not recognize, it signals an syntax error. In the example above, to-r^ would, in one step translate the above expression into

(r^seq (to-r^ (if (tuesday?)
                  (r^range #\a #\b)
                  (r^range #\c #\d)))
       (to-r^ (r^range #\x #\z)))

and then the first inner occurrence of to-r^ would raise the error.

Still, we need to define r^range, r^or, r^seq, and r^* as macros, for two reasons. First, when one of those expressions appears outside of a r^match (or, later, a r^define), it should be a syntax error. Second, when we introduce a macro definition for them, we provide an anchor for to-r^’s identifier matching and to cooperate properly with, e.g, require prefixing or renaming.

17.1.3 Implementation Strategy Part 2: r^match Run Time

There is more than one way to implement the runtime portion of the regular expression matcher, ranging from subtle details of how to reuse Racket’s regular expressions to implementing a regular expression matcher from first principles. This section discusses one way to do it; feel free to use your own approach. Note that if you do implement your own regular expression matcher, it is easy to get it wrong and also easy to implement a matcher with the right behavior but that has exponential-time behavior where it should not.

The heart of the suggested implementation is the r^ struct, corresponding to the run-time representation of an r. The remainder of this section gives an overview of the fields of r^ and then shows a number of examples to try to illustrate how different example rs correspond to different r^s.

struct
(struct r^ (reg-str func width)
    #:transparent)
  reg-str : string?
  func : (-> (vectorof (or/c string? #f)) natural? any/c)
  width : natural?

Each r^ struct instance corresponds to a specific r expression.

The reg-str field of the r^ struct contains a string that is suitable to pass to regexp to build a Racket regular expression for its corresponding r (with additional parentheses in it so the relevant pieces of the match can be extracted).

The func field is a function that processes the result of Racket’s regexp-match function when the match succeeds. It is called by r^match once the regular expression is complete and a successful match happens. That is, r^match performs a match using reg-str as a regular expression and if the match succeeds, it calls func. One subtle point: r^match must add "^" to the start of the string and "$" to the end (in order to ensure that the match covers the entire string) before passing the string to regexp-match.

The first argument that r^match supplies to func is a modified version of regexp-match’s result. More specifically, when regexp-match succeeds, it returns a list whose first element corresponds to the entire string that matched and the remaining elements correspond to the positions where parentheses appeared in the regular expression. Accordingly, r^match will drop the first element, convert the rest of the elements into a vector (in the same order) and then call func, passing the vector as the first argument. The second argument to func is a natural number indicating which position in the vector corresponds to this regular expression’s results. Initially, this is 0, but if the original r has subexpressions, then the value of the second argument becomes larger to indicate where the starting position of the match occurs.

The width field is the number of open parentheses in reg-str.

Let’s consider a few examples to see how the pieces of this struct work together. The regular expression (r^range #\a #\b) corresponds to this struct:

(r^ "([a-b])"
(λ (strs i) (string-ref (vector-ref strs i) 0))
1)

Accordingly, r^match matches the regular expression (regexp "^([a-b])$") against the given string. If the string is "a", then the match succeeds and r^match passes (vector "a") and 0 to the procedure, returning the result of the procedure as r^match’s own result. If the match fails, r^match just returns #f without invoking the procedure.

The r^or form constructs a regular expression string by combining the regular expression strings of its component regular expressions with |. For the width, it sums the number of parentheses, plus one for the pair it adds around the entire expression. For example, (r^or (r^range #\a #\b) (r^range #\c #\d)) should produce a struct like this:

(r^ "(([a-b])|([c-d]))"
    (let ([r1^ ...]
          [r2^ ...])
      (λ (strs i)
        (define p1 (+ i 1))
        (define p2 (+ i 1 (r^-width r1^)))
        (cond
          [(vector-ref strs p1)
           ((r^-func r1^) strs p1)]
          [(vector-ref strs p2)
           ((r^-func r2^) strs p2)]
          [else #f])))
    3)

The procedure closes over r1^ and r2^, the structs that the arguments to the r^or produced. In your actual implementation, it probably makes the most sense to close over a list of the r^ structs and loop over them. It should take the first successful match from the candidates in the list.

Note that, when regexp-match matches a regular expression with a | and nested parentheses, the corresponding positions in its result may be #f, indicating which alternative matched. For example,

> (regexp-match (regexp "(a)|(b)") "a")
'("a" "a" #f)
> (regexp-match (regexp "(a)|(b)") "b")
'("b" #f "b")

Like r^or, r^seq form similarly constructs a string by concatenating the strings it is given, but this time without adding a | between them; it also adds one to the sum of the widths.

For example, (r^seq (r^range #\a #\b) (r^range #\c #\d)) produces a struct like this:

(r^ "(([a-b])([c-d]))"
    (let ([r1^ ...]
          [r2^ ...])
      (λ (strs i)
        (define p1 (+ i 1))
        (define p2 (+ i 1 (r^-width r1^)))
        (cond
          [(and (vector-ref strs p1)
                (vector-ref strs p2))
           (vector-immutable ((r^-func r1^) strs p1)
                             ((r^-func r2^) strs p2))]
          [else #f])))
    3)

where, as before, the procedure shown here illustrates only how the procedure would work when there are only two regular expression inside the r^seq. The generalization needs to ensure that the match succeeded at all of the positions and then construct an immutable vector with all of the results.

The r^* operator adds a * to the string for its subexpression and also wraps it in parentheses and adds one to the width. For example, (r^* (r^range #\a #\b) #\c) corresponds to this r^ struct:

(r^ "(([a-b])*)"
    (λ (strs i)
      (cond
        [(vector-ref strs (+ i 1))
         (proc strs (+ i 1))]
        [else default]))
    2)

Because the regular expression matcher already returns the value of the last match, the procedure in the func field can simply extract it; it does not need to refer to the func field of the regular expression it contains.

17.1.4 Implementation Strategy Part 3: r^define

There is one more issue to handle: the identifiers that r^define introduces. These identifiers should be allowed only when they appear as an argument to to-r^, so we need a way to recognize them. The best approach is to use define-syntax to bind them and to call syntax-local-value at compile time to look up their values. As it turns out, define-syntax accepts arbitrary values, not just procedures. For example, if we write this code:

> (define-syntax my-favorite-pie "apple pie")
> (define-syntax my-favorite-fruit "blueberry")

Racket is perfectly happy to evaluate it, binding my-favorite-pie and my-favorite-fruit, at compile time, to strings. Of course, if we try to use those identifiers in a normal way, they error:

> (string-append "an " my-favorite-fruit ", please")
eval:6:0: my-favorite-fruit: illegal use of syntax
in: my-favorite-fruit
value at phase 1: "blueberry"

We can, however, still extract the value of the binding, but at compile time (in a macro):

> (define-syntax (favorite-info stx)
    (syntax-parse stx
      [(_ favorite-thing:id)
       (define str (syntax-local-value #'favorite-thing))
       (define str-len (string-length str))
       #`(~a "your favorite thing, "
             #,str
             ", has "
             #,str-len
             " letters")]))
> (favorite-info my-favorite-pie)
"your favorite thing, apple pie, has 9 letters"
> (favorite-info my-favorite-fruit)
"your favorite thing, blueberry, has 9 letters"

Note that a use of favorite-info will, at compile time, determine the length of the favored object, and then expand into code that will, at runtime, concatenate the strings.

In this way, we can piggy-back on the macro system’s handling of variables to communicate information from binding occurrences of variables to their bound occurrences.

Using this idea, we can make r^define bind its given identifier to a struct that holds a secret identifier, as well as expanding into a definition of that identifier, binding it to the result of using to-r^ on the regular expression that is being defined.

Then, when to-r^ encounters an identifier, it can use syntax-local-value to see if the identifier is bound to the struct and, if so, extract the secret identifier and simply expand into it.

Solution

← prev up next →

1	Language-Oriented Programming
2	Macro Expansion
3	Language Extensions via Macros
4	Lab Simple Macros
5	Advanced Racket Macros
6	Lab Syntax Patterns & Classes
7	Lexical Scope, (Un)Hygienic Macros
8	Lab Comfortable Macros
9	Modules, Macros, Languages
10	Lab Languages via Macros
11	More Language Variations
12	Lab Languages and Readers
13	Types and Type Checking
14	Typed Languages with Turnstile
15	Lab My First Turnstile Language
16	Two Techniques
17	Extended Lab Regexps
18	Some Language Gems
19	Closure

17.1	Untyped, Slow Regular Expressions
17.2	A Weak Type System
17.3	A Better Type System
17.4	Better Performance through Smarter Compilation
17.5	Improving Repeat to Produce Lists

17.1.1	Regular Expression Language Specification
17.1.2	Implementation Strategy Part 1: r^match Compile Time
17.1.3	Implementation Strategy Part 2: r^match Run Time
17.1.4	Implementation Strategy Part 3: r^define