Nondeterministic Regular Expression Parsing Library For arc

By Conrad Barski, M.D. Sep 2008
Licensed under the GNU Free Documentation License 1.2

Hi Everyone- I've put together a little regex library for my favorite Lisp dialect, arc lisp. The cool part IMHO is that the regular expressions are matched nondeterministically. This means, the program will speculatively evaluate arbitrary code based on preliminary matches and will "roll back" if it turns out follow-up matches fail. Such nondeterminism is straightforward in languages that support "call with current continuation", but nigh impossible in traditional languages. Check out pg's On Lisp for details on implementing nondeterminism. The only common languages that have this ability are arc and scheme, a close relative to arc. This code runs as a standard library on top of the standard pg arc implementation

Another cool thing about this library is that it has what I would argue is the "canonical" arc style. This style includes:

Simplicity- The most confusing part in using regexes, in my opinion, is the process of variable extraction. This regex library uses a new concept for extracting matched fragments. (Since nothing is new in CS anymore I'm sure someone has done this before- But it's new to me, at least.) The source code is also very simple (Well, at least given the complexity of the problem- I had to write part of it in CPS style and implement a lazy list system for backtracking)
Extreme terseness- Arc code should have as little visual noise as possible. While this library isn't as terse as perl for string matching, it doesn't require special syntax baked into the language. Nonetheless, when used for non-trivial matching tasks, I'm hoping it will still give perl a run for its money for readability as well as terseness. This is because it integrates elegantly into other arc features and the flow of the code is easy to understand.
Onion Free- This library has all the major "winning" regex features but is not meant to be fully compliant to any standard. The goal is, instead, to leave out any crufty historical features of regexes that were added to solve problems that can now be solved in better ways.

Not much focus was put towards performance in this system. From an "O notation" standpoint it should however, I believe, be equivalent to any backtracking regex implementation, such as in Perl 5. It seems likely, given Paul Graham's comments on the arc language, that arc lisp will soon move towards implementing strings as character lists. Because of this (and because of the ease of list manipulation in arc) this library converts any string into a character list before the match begins, requiring some extra cycles/memory.

Supported Regex Features

This library supports '^', '$', '.', '*', '+', '|', delimited repetition '{}', character classes '[]' with negation and character ranges. Shorthand classes '\d' '\s' and '\w' are supported. It also supports parentheses for grouping, but not for extraction. Extraction is handled by a different mechanism described below. It performs regexes in "multiline" regex mode. Single line mode I am treating as an onion for now. Most other esoteric regex functionality I believe can be duplicated by the new features described below.

Examples

To perform a simple match, you need to first declare the target string, using the w/target command. This commands binds the target string into a special regex buffer- You can kind of think of this buffer as being like "stdin". The body of w/target defines the lifespan of binding. Then, you can use re command to match a regular expression fragment against the target. The re commands simply returns the part of the target that matched the expression (using the standard "eager" regex evaluation behavior)

arc> (w/target "iiiix2yiiii" 
               (re "x.y"))
"x2y"

The w/target command, like any standard lisp progn form, simply returns the final child form value- In this case, the result of the expression match.

Things get more interesting when we use re several times inside of the body of the w/target command:

arc> (w/target "iiixi3ix2iii"
               (re "x")
               (re "\\d"))
"2"

What we did here is search for an 'x' followed by a digit. As you can see, the two expressions are matched as a unit, meaning that the earlier digit '3' is not matched, since it is not preceded by an 'x'. It also did not fail when an earlier 'x' was encountered. In order to do this, the program speculatively evaluated the second re, then rolled back when it failed against the location with the first 'x'. We can see this by putting print statements in the code:

arc> (w/target "iiixi3ix2iii"
               (re "x")
               (prn "I'm here!")
               (re "\\d")
               (prn "No, I'm here!"))
I'm here!
I'm here!
No, I'm here!

As you can see, it got past the first regex match twice, but only once fell through the end of the function. The first match led to a failure "downstream" and was therefore aborted early. As is usual in implementations of nondeterminism, you'll usually want to write the code inside of a w/target statement to be functional without any side effects, at least until you get comfortable with the strange control flow.

Since the code inside of the w/target body can be any arbitrary arc logic, you can use any standard method for extracting out your data. In this example, we just put the three matched fragments into a list:

arc> (w/target "axaaxxxxaaxxabtbbttttbbttb"
               (list (re "ax{2,3}a")
                     (re ".*")
                     (re "bt{2,3}b")))
("axxa" "btbbttttb" "bttb")

Here, we've assigned the res to some variables:

arc> (w/target "axaa<foo>xxx/bar/xaax"
               (with (a (re "<")
                      b (re "[^>]*")
                      c (re ">.*/")
                      d (re "[^/]*")
                      e (re "/"))
                     (prn "b is " b " and d is " d)))
b is foo and d is bar

If you want to match multiple times against the string, you can use the command draintarget, instead. This function works analogously to the arc drain command, building a list of results until the target is drained of data:

arc> (draintarget "iiiiiaXaiiiiiaYaiiii"
                  (re "a.a"))
("aXa" "aYa")

I'm hoping you will agree that the features described here will let you integrate regex matching elegantly into any arc code you may need to write, assuming you have the good fortune to need to write some arc code.

Fancy Examples

Finally, let me show you some more complex examples that show you how to get the most from this library. Here, we build a list of arc person objects using email addresses extracted from the target. (not a comprehensive email regex, of course) Notice also that, in this example, the re is in a completely different function. This illustrates that there is no sleight of hand in this library: You can "smear" the expression parsing across any arbitrary set of functions- Nothing would stop you from writing a library of custom parsers you can reuse (Note this would let you create something reminiscent to Haskell's Parsec, if this is interesting to you...)

arc> (= text "asdweotwerw 2weroq bob.smith@google.com sdfa
              jsad;fj lisa.north@yahoo.com weqporiuasf9")
     (def name ()
        (re "[a-z]+"))
     (draintarget text
           (w/table p
                    (re " ")
                    (= p!first (name))
                    (re "\\.")
                    (= p!last (name))
                    (re "@")
                    (= p!domain (name))
                    (re "\\.")
                    (= p!tld (name))
                    (re " ")))
(#hash((tld . "com") (domain . "google") (last . "smith") (first . "bob")) 
 #hash((tld . "com") (domain . "yahoo") (last . "north") (first . "lisa")))

Here we do something that may be impossible in a regex, possibly even in perl 5. I'm no perl expert though, so I can't say for sure: We are searching a string for words that are inverses of each other, also called "anadromes". We can do this, because we can use the results of earlier regex fragments to build future ones, dynamically:

arc> (def revstr (str)
          (string:rev:charlist str))
#<procedure: revstr>
arc> (w/target "zebra\nhorse\nstage\ndonkey\negats\ngod\nmouse"
               (withs (a (re "^[a-z]*$")
                       x (re ".*")
                       b (re (+ "^" revstr.a "$")))
                      (prn a " is an anadrome of " b)))
stage is an anadrome of egats

As a final example, here we parse numbers out of a string and skip over odd numbers. We can accomplish this by using an additional function called re-fail which explicitly declares that a failure has occurred and that a rollback needs to be initiated. Note also how the code mindbendingly checks for the ending bracket after checking for "odd"ness- This works just fine, but you could of course reverse the checks for greater efficiency.

arc> (draintarget "<250> <232> <251> <23> <66> <33>"
                  (re "<")
                  (let a (read:re "\\d+")
                    (unless odd.a 
                      (re-fail))
                    (re ">")
                    a))
(251 23 33)

The unit tests for my library, so far, are pretty limited. Please feel free to send me any failed test cases you may find- This is just a little library I've built quickly and for fun, so expect that there are probably still many bugs to shake out.

Download

re.arc

re-tests.arc

Attention Lispers! Keep an eye out for my Common Lisp textbook/comicbook "Land of Lisp", which will be published in early Spring by "No Starch Press"!