By the time we wrapped up this project Intel finally released Hyperscan as Open Source. Hyperscan implements these and other transformation tricks, not to mention the SIMD optimizations.
However, Hyperscan doesn't have as strong compatibility for PCRE as what we ended up with--100% effectively--and would have necessitated keeping libpcre as a failover. Hyperscan is substantially faster than Ragel for small-to-medium length expressions thanks to their SIMD optimizations, but with huge, unioned expressions (larger than what Hyperscan can handle) Ragel-generated code is on par.
Starting from scratch today and presuming Hyperscan is still well-maintained, it would be most practical to build a solution around Hyperscan. Especially if you need capturing, as capturing expressions can't be unioned effectively. Ragel, however, makes a ton sense for many other tasks than merely speeding up PCRE matching.
What makes Ragel unique is:
1) The way it integrates with the host language (C, C++, Java, etc). It allows executing host-language expressions at state machine transitions. It doesn't directly tie into the host language typing model (e.g. like Boost's template-based regular expressions), but for various reasons it's hardly a liability and often an asset.
2) It allows yielding and resumption of matching at any and all input stream boundaries, with runtime machine state being a simple integer. (Application keeps any additional state however it wants.) Most regular expression tools require a complete string to match against, while a Ragel machine can be iteratively applied to an infinite stream of data. That capability wasn't useful in this scenario, but it's extremely useful for concisely, performantly, and securely implementing network protocols, for example. In a low-level language like C, and especially with asynchronous I/O, this can drastically simplify the task of I/O and buffer management. You could parse (if you wanted) an entire HTTP stream in a single logical pass, without having to deal with line buffering, header continuations, etc as separate concerns in your code, greatly simplifying networking code. In a some sense it can be thought of as a complement if not an alternative to callback schemes, futures/promises, or green threading (a la Go).
Interesting. I should point out that Hyperscan is not abandonware; it is still maintained by Intel.
Streaming is not unique to Ragel.
You're not wrong about libpcre compatibility. We have very tight syntactic compatibility with libpcre (that is, we won't misidentify an unsupported construct and supply some erroneous semantics) but we make no secret of our inability to handle general lookarounds and backreferences.
I'm interested in how you went about handling backreferences in Ragel. They have been in our 'too hard' basket for years, although many instances of backrefs are tractable in our framework. It's always seemed easier to not do any of them rather than handle just a few.
Ragel certainly ties into the host language differently, and more tightly, than Hyperscan. We use it ourselves, to lex regexes on the way into Hyperscan, as it happens.
As I alluded to earlier, the 100% (possibly +/- edge cases) solution was finished primarily by a contractor.[1] So the following only reflects the state of things while my Lua parser, analyzer, and transformation code was still in use. Also, my grasp of this stuff is nowhere near as strong as your's or the contractor's. I'm a pretender ;)
Backreferences are implemented with prefilters, NFA machines, and what are effectively zero-with assertions. It's not sexy, and the rest is easy to sketch out from there as a conceptual matter. What's more interesting are the Ragel-specific details.
Before we tackled backreferences we had a problem with repetitions causing unacceptable state blow up with unioned DFAs, repetitions, semantic conditions, and especially with their interactions. This was the biggest barrier for most of the final 10%. To help solve that the contractor added an NFA construct to Ragel, along with the magic needed for the generated Ragel state machines to switch from DFA to NFA modes mid-expression. The NFA construct expands on the notion of embedded host-language code in transition actions and semantic conditions, which allows embedding inline C code to track and control the matching behavior of each subexpression, as well as to handle bookkeeping, including state stack management.
Most simple captures can be accomplished using Ragel's regular action features. But IIRC even captures that could be done from a DFA machine would sometimes blow up state too much so I think the captures necessary for backreferences are always done in NFA mode. Similarly, the code is quick to fallback to NFA mode for expressions without backreferences as it simplified grouping some of the complex expressions into secondary unions, which allowed compiling their prefixes as a DFA; as opposed to immediately falling back to prefilters and a linear match of each expression that couldn't be added to the primary or secondary unions.
Matching of backreferences is implemented similar to lookarounds. Lookarounds, at least in my early code, are implemented using semantic conditions. In my later code I think lookbehinds still had to be a fixed size, but it looks like (bugs notwithstanding) backreferences could be variable sized. The Ragel match is anchored on the empty string at the relevant position using the NFA construct, auxiliary code handles matching against the capture, and on a match the `p' index variables is updated to skip over the length of the match. The NFA construct saves and restores `p' as it trials each subexpression. Of course this approach doesn't work well in streaming mode. Similar to Ragel scanners I think theoretically the host application could arrange for windowing, but I don't know if everything is in place to make that feasible. In our case we didn't need the streaming.
And at least in my code, anything with backreferences automatically gets a prefilter and is only run conditionally in a second phase. Presumably that's so we don't need to do captures for stuff that will never match, whether or not it involves an NFA. I'm sure that's how it's still implemented.
Regarding streaming, I totally forgot about Hyperscan's streaming mode. Streaming isn't something that (IME) can be readily bolted on after the fact, so it's a testament to it's careful design and forethought.
But in C, especially in a complex networking application doing asynchronous I/O, callbacks are so extremely costly--they break code locality and compound the normal headaches with ownership and lifetime. Managing that complexity is of utmost importance in a language like C. Using Hyperscan in streaming mode, especially in combination with DPDK... I imagine the project failure rate is really high. I know of at least one skillful team that bombed precisely as I predicted because of the excessive complexity and the resulting inability to stabilize the product. The callback/push/interrupt model is too difficult in C. In languages with lambdas, Hyperscan's streaming mode has no handicap. But especially in C, Ragel's idiosyncratic host-language interface, with it's embedded action handling and minimal state, is just so incredibly important for reducing complexity and for sticking as closely as possible to a pull model for data consumption without sacrificing performance. If developers spend most of their time working on and fixing the I/O code and related interface boundaries (as is typical), there's much less time for implementing the functional features that matter. I don't doubt people have built a ton of amazing things with DPDK and Hyperscan's streaming mode, but the number of teams that can do that _correctly_ are few and far between. Honestly, I wouldn't want a solution like that protecting my data unless I knew the team had the requisite skill.
On second thought... I should reel that criticism in a little bit. ;) For just matching and cataloging "hits", rather than performing complex operations on them, it's much less of a problem. The costs will be a function of how much work is done on the other side of the callback. But it's just so much more difficult to balance efficiency and complexity, and to constrain complexity in general. I've been writing asynchronous I/O services since before libevent and I've just become constitutionally averse to callbacks. I've never used libevent's buffering API because before it even debuted I had internalized how poorly suitable that sort of interface is in terms of interface composition. Callbacks are necessary--e.g. at the interface boundary of the event loop--but I try so hard to avoid them when possible, including switching to languages other than C when possible. It's just a huge consideration for me, perhaps unreasonably. :)
Also, I understand that Hyperscan's callback interface is largely dictated by other considerations, like the perfectly sensible choice for a runtime library interface. In streaming mode a generic queuing and retrieval interface would be awkward and suboptimal for most cases, and there's no good substitute for pushing that implementation detail to the application. DPDK, OTOH, wasn't forced to adopt an asynchronous interrupt model, it was just the easiest.
[1] I don't like to mention names of people not participating in a thread, even though in this case I'm sure you're familiar and he probably would appreciate the plug. You may have conversed with him at some point because your SIMD work is awesome, Hyperscan in general is amazing, and when it was finally open sourced we had to take a really hard look at adopting it or, alternatively, whether we should pursue SIMD optimizations on our own. And IIRC at some point he may have chatted with someone on the Hyperscan team.
Thanks for the kind words about Hyperscan. If you'd like to talk more, please get in contact (via the list, via our support alias, or our emails if you have them already).
I'm curious about the design issues w.r.t callbacks. We certainly haven't had complaints from most of our users about this, but maybe there would have been better models. Note that callbacks are partly a design pattern for our convenience - we need those big, complex stack frames for many of our engines. It's nice to throw a match out of the middle of a big unrolled loop and know that we can come back to the middle of the loop - not into the tedious 'preconditioning and warmup' part of the loop.
An alternate design is to fill a buffer of matches, but that has its own issues, especially with timeliness. Some users want to stop matching quickly after the first match - a problem if we go off and scan another megabyte to get the second through Nth matches.
In any case, we're always interested to hear about cases where we aren't the best solution, as this is always the most productive type of feedback to get. "Your library is great and you guys are really smart" is gratifying, but it doesn't lend itself well to continuous improvement.
> It allows yielding and resumption of matching at any and all input stream boundaries [whereas] Most regular expression tools require a complete string to match against
Yes! This is how I've always implemented from-scratch pattern matchers, even for toy systems. To be fair, about 75% of it comes down to it being the only way that feels natural. But on the other hand, even if it's something you don't need now, the alternative all-at-once approach is so inflexible that, should you ever come to need it, then it's basically impossible to refactor an existing implementation to have this kind of orthogonally-cutting "re-entrancy". So you'll have to either throw out the whole implementation and start over, or just try to ignore it and be forced to work around it as you bump against it time and time again as punishment for not doing it right the first time.
By the time we wrapped up this project Intel finally released Hyperscan as Open Source. Hyperscan implements these and other transformation tricks, not to mention the SIMD optimizations.
However, Hyperscan doesn't have as strong compatibility for PCRE as what we ended up with--100% effectively--and would have necessitated keeping libpcre as a failover. Hyperscan is substantially faster than Ragel for small-to-medium length expressions thanks to their SIMD optimizations, but with huge, unioned expressions (larger than what Hyperscan can handle) Ragel-generated code is on par.
Starting from scratch today and presuming Hyperscan is still well-maintained, it would be most practical to build a solution around Hyperscan. Especially if you need capturing, as capturing expressions can't be unioned effectively. Ragel, however, makes a ton sense for many other tasks than merely speeding up PCRE matching.
What makes Ragel unique is:
1) The way it integrates with the host language (C, C++, Java, etc). It allows executing host-language expressions at state machine transitions. It doesn't directly tie into the host language typing model (e.g. like Boost's template-based regular expressions), but for various reasons it's hardly a liability and often an asset.
2) It allows yielding and resumption of matching at any and all input stream boundaries, with runtime machine state being a simple integer. (Application keeps any additional state however it wants.) Most regular expression tools require a complete string to match against, while a Ragel machine can be iteratively applied to an infinite stream of data. That capability wasn't useful in this scenario, but it's extremely useful for concisely, performantly, and securely implementing network protocols, for example. In a low-level language like C, and especially with asynchronous I/O, this can drastically simplify the task of I/O and buffer management. You could parse (if you wanted) an entire HTTP stream in a single logical pass, without having to deal with line buffering, header continuations, etc as separate concerns in your code, greatly simplifying networking code. In a some sense it can be thought of as a complement if not an alternative to callback schemes, futures/promises, or green threading (a la Go).