17.1 Untyped, Slow Regular Expressions
Our first language design consists of five different kinds of regular expressions that can be combined with a string to perform a match. This section explains the constructs in the new language, discusses an implementation strategy, and offers some test cases.
Your task is to first read and understand the specification, then stub out an implementation so that the provided test cases all run (and, presumably, fail). Then, write 10 new test cases based on your understanding of the specification. Finally, try to work out an implementation that passes all of the tests.
There are a number of shortcomings in this language’s design and implementation and subsequent parts of the exercise address them.
17.1.1 Regular Expression Language Specification
syntax
(r^define regexp-id r)
syntax
(r^match r string-expr)
The grammar of regular expressions is as follows:
r | = | (r^range regexp-range-char regexp-range-char) | ||
| | (r^or r ...) | |||
| | (r^seq r ...) | |||
| | (r^* r default-expr) | |||
| | regexp-id |
regexp-range-char | = | any literal character except -, [, ], or ^ |
The five alternatives have the following semantics:
syntax
(r^range regexp-range-char regexp-range-char)
For example, (r^range #\a #\d) matches "b" and returns the result #\b and it does not match the string "bb" or the string "z".
syntax
(r^or r ...)
For example, (r^or (r^range #\a #\d) (r^range #\p #\r)) matches the string "b" (returning #\b) and also matches the string "p" (returning #\p).
syntax
(r^seq r ...)
For example, (r^seq (r^range #\a #\d) (r^range #\p #\r)) matches the string "bq", returning (vector-immutable #\b #\q). It also matches "ar", returning (vector-immutable #\a #\r).
syntax
(r^* r default-expr)
For example, (r^* (r^range #\a #\c) #\d) matches "ac" (returning #\c), matches "ba", returning #\a, and matches "" returning the default expression’s value, #\d.
It would be nice if r^* returned a list of all of the matches instead of just returning the last one; this improvement is the subject of Improving Repeat to Produce Lists.
syntax
regexp-id
17.1.2 Implementation Strategy Part 1: r^match Compile Time
A good strategy for implementing this language is to break the task into three parts. The first part handles the compile-time aspects of r^match (ignoring r^define), and the second part handles the run-time aspects of r^match and the third part adds r^define.
The remainder of this section discusses the first part. Implementation Strategy Part 2: r^match Run Time discusses the second part and Implementation Strategy Part 3: r^define discusses third part.
At a high-level, the compile-time compilation strategy simply translates each different form of regular expression into its own function call (and the functions and how they work are explained in the nest section). There are a few subtle points about the setup, however, that are worth pointing out.
Still, we need to define r^range, r^or, r^seq, and r^* as macros, for two reasons. First, when one of those expressions appears outside of a r^match (or, later, a r^define), it should be a syntax error. Second, when we introduce a macro definition for them, we provide an anchor for to-r^’s identifier matching and to cooperate properly with, e.g, require prefixing or renaming.
17.1.3 Implementation Strategy Part 2: r^match Run Time
There is more than one way to implement the runtime portion of the regular expression matcher, ranging from subtle details of how to reuse Racket’s regular expressions to implementing a regular expression matcher from first principles. This section discusses one way to do it; feel free to use your own approach. Note that if you do implement your own regular expression matcher, it is easy to get it wrong and also easy to implement a matcher with the right behavior but that has exponential-time behavior where it should not.
The heart of the suggested implementation is the r^ struct, corresponding to the run-time representation of an r. The remainder of this section gives an overview of the fields of r^ and then shows a number of examples to try to illustrate how different example rs correspond to different r^s.
struct
(struct r^ (reg-str func width) #:transparent) reg-str : string? func : (-> (vectorof (or/c string? #f)) natural? any/c) width : natural?
The reg-str field of the r^ struct contains a string that is suitable to pass to regexp to build a Racket regular expression for its corresponding r (with additional parentheses in it so the relevant pieces of the match can be extracted).
The func field is a function that processes the result of Racket’s regexp-match function when the match succeeds. It is called by r^match once the regular expression is complete and a successful match happens. That is, r^match performs a match using reg-str as a regular expression and if the match succeeds, it calls func. One subtle point: r^match must add "^" to the start of the string and "$" to the end (in order to ensure that the match covers the entire string) before passing the string to regexp-match.
The first argument that r^match supplies to func is a modified version of regexp-match’s result. More specifically, when regexp-match succeeds, it returns a list whose first element corresponds to the entire string that matched and the remaining elements correspond to the positions where parentheses appeared in the regular expression. Accordingly, r^match will drop the first element, convert the rest of the elements into a vector (in the same order) and then call func, passing the vector as the first argument. The second argument to func is a natural number indicating which position in the vector corresponds to this regular expression’s results. Initially, this is 0, but if the original r has subexpressions, then the value of the second argument becomes larger to indicate where the starting position of the match occurs.
The width field is the number of open parentheses in reg-str.
(r^ "([a-b])" (λ (strs i) (string-ref (vector-ref strs i) 0)) 1)
(r^ "(([a-b])|([c-d]))" (let ([r1^ ...] [r2^ ...]) (λ (strs i) (define p1 (+ i 1)) (define p2 (+ i 1 (r^-width r1^))) (cond [(vector-ref strs p1) ((r^-func r1^) strs p1)] [(vector-ref strs p2) ((r^-func r2^) strs p2)] [else #f]))) 3)
> (regexp-match (regexp "(a)|(b)") "a") '("a" "a" #f)
> (regexp-match (regexp "(a)|(b)") "b") '("b" #f "b")
Like r^or, r^seq form similarly constructs a string by concatenating the strings it is given, but this time without adding a | between them; it also adds one to the sum of the widths.
(r^ "(([a-b])([c-d]))" (let ([r1^ ...] [r2^ ...]) (λ (strs i) (define p1 (+ i 1)) (define p2 (+ i 1 (r^-width r1^))) (cond [(and (vector-ref strs p1) (vector-ref strs p2)) (vector-immutable ((r^-func r1^) strs p1) ((r^-func r2^) strs p2))] [else #f]))) 3)
(r^ "(([a-b])*)" (λ (strs i) (cond [(vector-ref strs (+ i 1)) (proc strs (+ i 1))] [else default])) 2)
17.1.4 Implementation Strategy Part 3: r^define
> (define-syntax my-favorite-pie "apple pie") > (define-syntax my-favorite-fruit "blueberry")
> (string-append "an " my-favorite-fruit ", please") eval:6:0: my-favorite-fruit: illegal use of syntax
in: my-favorite-fruit
value at phase 1: "blueberry"
> (define-syntax (favorite-info stx) (syntax-parse stx [(_ favorite-thing:id) (define str (syntax-local-value #'favorite-thing)) (define str-len (string-length str)) #`(~a "your favorite thing, " #,str ", has " #,str-len " letters")])) > (favorite-info my-favorite-pie) "your favorite thing, apple pie, has 9 letters"
> (favorite-info my-favorite-fruit) "your favorite thing, blueberry, has 9 letters"
In this way, we can piggy-back on the macro system’s handling of variables to communicate information from binding occurrences of variables to their bound occurrences.
Using this idea, we can make r^define bind its given identifier to a struct that holds a secret identifier, as well as expanding into a definition of that identifier, binding it to the result of using to-r^ on the regular expression that is being defined.
Then, when to-r^ encounters an identifier, it can use syntax-local-value to see if the identifier is bound to the struct and, if so, extract the secret identifier and simply expand into it.