We had a question about our code names, techniques, etc:
---------- Forwarded message ----------
From: Paul Wankadia < ... >
Date: Wed, Sep 28, 2016 at 8:34 PM
Subject: Hyperscan glossary
To: hyperscan@lists.01.org<mailto:hyperscan@lists.01.org>
Hi, Hyperscan folks.
Is there a glossary of the various codenames (analyses, engines et cetera) used within
Hyperscan? I've just encountered
src/grey.cpp<https://github.com/01org/hyperscan/blob/master/src/grey.c... and
I'm honestly bewildered. Bonus points for links to papers.
I'm also asking because
http://halcyon.usc.edu/~pk/prasannawebsite/papers/infocom2011edward.pdf sounds like
something that you might very well have implemented, but I can't tell either way. :)
Thanks!
OK. So this is a good question, and we don't have an 'implementer's guide'
to match our 'user's guide'. Honest bewilderment is really the mildest emotion
that should grip anyone looking at the grey box flags, although we are not as exuberant as
we used to be. For example, we no longer have a grey box flag entitled
"hamsterAccelerateSideways", for example, and the subsystems codenamed
"Vomitsauce", "Infectoid" and "Puggsley" are nowhere to be
found in trunk.
We don't do the thing in that paper, but we have some similar techniques and have
considered something vaguely related.
A bit of background: we spent several years working quietly on this stuff (pretty much
from 2007 to late 2015). Up until October 2013 we were a startup company, single-mindedly
focused on getting to commercial viability. Thus the first few years were spent iterating
very rapidly on the code base and frantically implementing every idea we had. We certainly
weren't well off enough (nor inclined) to patent everything in sight, and we
didn't have time to write papers. Frankly, we've had more ideas than we have time
to implement them and haven't really spent that much time staying up with the
literature, which in any case is often not concerned with practical regex implementation
on the scale or speed at which we operate. This is OK; many of the corner cases we deal
with would make terrible academic papers. Also, some of our techniques might be best
described as "cures for self-inflicted injuries"; if you are smarter than we
are, you might be able to avoid the injury in the first case. Hyperscan was largely
developed under continuous commercial pressure; I doubt a clean sheet design would do
things exactly the same way.
The other problem with the literature (for our purposes), is that we moved to a model
where we decomposed regular expressions using literals quite early, and we've always
had mechanisms to run many different engine types. As soon as you do that, you're no
longer really worrying nearly as hard about how to torture DFAs into taking hundreds of
patterns, or to handle dot-star constructs or counters, etc. These are still interesting
topics, but we tend to want to have simpler, faster engines, rather than building the One
True All Purpose Regex Mechanism. Of course, if someone smarter than us comes along and
builds that mechanism and does better with it, we'll be impressed.
Our design decision are also dictated by a number of hard requirements: including
fixed-size stream states, a requirement to do the run-time in C (not C++), and no dynamic
allocation permitted at run-time aside from per-stream stream states and per-context
scratch space.
So - many of the ideas in Hyperscan are sui generis and unpublished. As they are open
source, the world is welcome to them. We would be very happy if at least some people
outside of Intel put the effort in to contribute to Hyperscan; part of the reason that
it's complex is that it is a mature project that works on a lot of cases. If you
don't know how or where to start with a contribution to Hyperscan, contact us, and
we'll help. Really. We have a big list of projects, a good topic for another time.
Failing contributions, we would at least like to be benchmarked against. It doesn't
have to be good news. Bad news ("your library stinks and this is why") has
traditionally been more productive for us.
We'd also like to hear about applications, especially outside our traditional network
security areas.
Anyhow, in the interest of getting started, it makes sense to do a data dump of code
names, techniques, etc. It's expected that you've read the user guide. This is
intended for people already working in the area, and may as well come with a giant
"You Are Not Expected To Understand This" comment banner. Knowing what these
names mean might at least make reading the source slightly easier. Unless you implement
regular expressions using automata of some kind, I doubt this will be very illuminating.
I will explain a few of our code naming schemes briefly (presidents, generals, flowers,
etc.), although you may be none the wiser for having seen them, given that we tend to
follow them capriciously.
Yes, it would be better if all this came with lots of nice diagrams, examples, references,
and so on, but the effort to properly document this is enormous. We may be able to
publish some internal documentation over time now that we're open source but it should
be reviewed to see how much of it still describes the current state of affairs.
* NFA Graph: we generally use Glushkov NFAs as both an internal representation for
regexes (we compile away the textual regexes pretty early) as well as an implementation
strategy. Glushkov NFAs are nice because they allow independent computation of
'reachability' (which states can I get to on a given character) and
'successor' (which states can I get to from the current set of states that I'm
in) properties and the intersection of the two is your new state. The Glushkov NFA
construction dates from 1961, and IMO is an improvement on its predecessors as well as its
successors.
o We have a lot of graph analyses designed to simplify and reduce NFA graphs or discover
useful things about them. Probably beyond the scope of this already-enormous document.
* Rose: our overall orchestration layer. This is a big piece of glue code that runs
the literal matchers, regex engines (which we occasionally refer to, generically, as NFA
engines, even when they aren't really NFAs - this is a reference to the fact that they
typically simulate an NFA), etc. Violet is a new part of Rose for picking better literals
as factors of expressions. Generally orchestration layer things are named after flowers.
o Rose generally splits up large NFA graphs with (mostly) literals separating them. We
still have a limited notion of role in Rose, which is, essentially a point between a
literal, but the special-purpose Rose interpreter (not a general purpose interpreted
language) has subsumed most of the data structures we used to have here. It is best to
think of Rose as a big graph composed of literals and regex engines between the literals,
although the ability to merge weakly related regex engines for efficiency as well as use
literals in multiple places makes this a pretty awkward metaphor.
o Engines have engine types: so abstractly we call them NFA engines but in practice they
could be many things: an NFA, a DFA of various kinds, or some other specialized engines.
These implement regular expression portions that we don't want to (or can't) split
any further.
o We score literals in our NFA graph (longer literals have a better - or lower score)
and treat the problem as a 'network flow' (netflow) problem - what's the best
set of literals that split our start states from accept states? Well, run a network flow
through the graph (from start states as sources to accept states as sinks) and find out
your min-cut! I am more pleased with this trick than I should be, given some obvious
shortcomings. Someone more skilled in abstract algebra could likely help with these
issues.
o Depending on how the split happens, you get different engine kinds. A prefix occurs if
we have a literal only after the regex portion, e.g. /<R>foo/ for a complex regex R.
An infix happens when there are literals before and after: /foo<R>bar/. Suffixes are
/foo<R>/, and "outfixes" are when things fail: /<R>/ when <R>
has nothing we can split with (outfixes still work just fine, they just don't have as
many optimizations available). We have a lot of different optimizations based on whether
an engine is 'guarded' on its left or right. For example, an engine guarded on its
left can be ignored until the literal happens. An engine guarded on its right doesn't
need to be run in with 'accept states' constantly checked, as we can simply run it
to the point where the literal occurs, and check what state it's in.
? Prefixes and outfixes are effectively on 'immediately' as they are not guarded
to the left by any literal. Infixes and suffixes are only triggered when the literal to
their left appears.
? Suffixes and outfixes are 'output exposed'; they can raise matches that are
user-visible. Infixes and prefixes are not output exposed.
? A 'transient prefix' is a limited-width prefix that we only need to check when
the literal happens. Otherwise, prefixes (and outfixes, of course) need to be kept live in
our streaming mode, as we need to know the state that they would reach before we throw
away old input in our streaming writes. For example, if you have a regex /\s\d+\s+foobar/,
and don't see a 'foobar' until your 20th streaming write, 10KB into the
stream, you can't go back and see whether or not the \s\d+\s portion of the regex
matched or not. So you need to run prefixes busily (usually).
? A 'miracle' is something that happens with a prefix or actively running infix
when we look at the input either at a stream boundary (end of stream write) or just before
the literal that is to the right of the prefix/infix. If the input would reset the engine
to a known state (especially 'dead') we don't need to run the engine over the
whole input. So if your prefix is /\s\d+\s/ and the last few characters of your buffer
aren't all things that match [\s\d] then you know what your state is going to be
without looking at the whole input (this refers to the Then A Miracle Occurs math
in-joke).
? We merge engines of the same 'kind' (i.e. you can merge prefixes, infixes, etc.
but not a prefix with a suffix). This can help with performance and stream state.
? A top is a distinct start state corresponding to a trigger (and thus is relevant to
infixes and suffixes). If an engine has multi-top capability it means that the engine can
represent several different NFA graphs starting from different points.
o The Rose interpreter does lots of things, including orchestrating all the other engine
executions. It can also do a limited amount of 'lookaround' checking, which allows
us to check whether an engine needs to run at all based on a local examination of our
context. This is more general than 'miracles' in that we can look both forward and
backward in our input. We have some better (SIMD) lookarounds on the way.
* FDR: is both our literal matching framework and a particular literal matcher.
Sorry. Our current and future literal matchers will be named after Presidents of the
United States of America, although we will avoid using recent figures for fear of igniting
partisan passions. So unless you are still holding a partisan grudge against a
pre-Eisenhower president, rest easy.
o FDR as a particular literal matcher is a strided (can skip characters if literals are
long enough and not too numerous) SIMD-based bucketed (8 buckets at once) shift-or matcher
over wide 'super-characters' (currently 9-15 bits)- this one is aimed at larger
scale literal matching, where we have between about 70 and about 20,000 literals. We have
plans to adapt to larger scale literal match tasks, as FDR isn't great when the task
gets really big.
o Teddy is a FDR framework-based SIMD shift-or literal matcher that uses PSHUFB. We like
PSHUFB, and frequently use it either to reorganize input data or as a mechanism to do a
quick 4-bit lookup for 16/32/... bytes at once. Effectively Teddy looks for 4-bit nibbles
of our input strings at various locations and we have a bunch of variants depending on
number of nibbles looked at (2-8, corresponding to 1-4 bytes) and how we make use of
AVX2.
o Noodle is a non-FDR-framework literal matcher for fast matching of single strings
using a simple PCMPEQB based method.
o FDR framework literal matchers typically call the same, not-very-good confirm code,
share facilities for handling very long literals, floods of a single character, etc. The
whole framework isn't great and is being rewritten.
o hwlm is an almost entirely obsolete framework for running FDR framework literal
matchers and noodle, and also has some facility to do SIMD acceleration up to the start of
the first possible literal match. Believe it or not, hwlm stands for "hamster wheel
literal matcher". No-one understands what the grey box flag
"hamsterAccelerateSideways" was meant to do.
o HLM was an early literal matcher based on hashing; it was essentially a multiple hash
table based matcher (like doing several Rabin-Karp matchers at once). Early versions of
HLM were called Vomitsauce, after the article on "systems" papers by Jonathan
Shewchuck. At other times, thanks to Jonathan, we also had subsystems called Puggsley and
Infectoid. See "Three Sins of Authors in Computer Science and Math" at
https://www.cs.cmu.edu/~jrs/sins.html
* LimEx: this is our NFA engine. The "Lim" portion refers to
"Limited" transitions (transitions from one NFA state n to a higher numbered
state n+k for a small k - generally 0,1,2,...). These are implemented by shift-and.
"Ex" is the exception mechanism that allows other transitions (backward or large
forward), accept state handling, bounded repeat handling, etc. If we have any state with
an 'exception' we have to process the ones that are switched on, one by one.
o Squashing is the process of turning off redundant NFA states to reduce overhead. In
theory NFA states in a perfect bit-parallel implementation should be 'free' to
leave on, but in practice (due to acceleration - see below - and the 'Ex' -
exception mechanism) they aren't. So we squash NFA states that are redundant with some
other state. For example, if we have a .* state somewhere, other NFA states that only lead
to it and no other externally visible state can be squashed once the .* state is switched
on (assuming DOTALL .*) as they won't do anything interesting.
* DFA engines: these are named for generals; American Civil War generals were the
starting point, but a few other nationalities and time periods have wandered in over
time.
o McClellan is our main production DFA engine. This isn't that unusual for a DFA -
we have a 8-bit wide and a 16-bit wide variant. We also have a compressed state scheme
(Sherman) for states that are not expected to be frequently accessed and are quite similar
to some other state.
o Gough is our "Start of Match" tracking DFA engine. There is some interesting
work in tracking potential 'starts of match' around the DFA; essentially
there's a little optimizing compiler in there. We had a less interesting variant of
this earlier called Haig which was more indiscriminate about tracking potential starts and
used PSHUFB to track them; this was obsoleted by Gough.
o Sheng is a very fast PSHUFB-based engine that can process states at 1 cycle per byte,
as long as your DFAs are small (no more than 16 states). This turned out to be an
independent invention of the same technique done by some other researchers, but it's
still pretty neat.
o We have a few technologies waiting in the wings for multiple simultaneous DFAs, etc.
but nothing else in trunk now.
* Most engines (NFA/DFA) can accelerate. This means that when we reach a particular
state or 'accelerable' state set we can run a cheaper SIMD check of our input and
skip input without changing our state.
o Acceleration schemes are generally SIMD code and include vermicelli (a
single-character check for a particular character with PCMPEQB) and shufti (a check for
character classes using 2 PSFHUBs per character) or truffle (a fully general check for
character classes using 3 PSFHUBs). We can also have 'double' variants where we
use 2 vermicelli checks or 2 shufti checks for adjacent characters or character classes -
this is slower in terms of maximal possible speed, but we tend to 'fall out' of
acceleration less frequently.
o Offset acceleration looks deeper into our NFA graph to find a more effective
acceleration scheme. For example, suppose we are searching for /[\w][xX]<R>/ (again,
<R> is a complex regex) - it is cheaper to look for [xX] than \w and less likely to
match, so we can search for that and adjust how far we move our current input to suit the
fact that we have 'offset' our acceleration.
* LBR is 'large bounded repeat'. We do a lot of things here. If you give us
a pattern like /a.{20000}b/s we get it right. It's not pretty, but we handle this case
and a lot of the easier subcases. The problems here are both implementation overhead and
state space (remember, we stream and have to save out worst-case stream state). We have
optimizations for a lot of cases, including X{0, n}, X{n, infinity}, X{m, n} where n-m is
fairly big compared to n, and cases like /abcde.{2000}xyz/s - the latter state doesn't
require 2000 bits as the string "abcde" can't start at every position. We
have some nice combinatorics for the latter one based on the number of ways you can tile a
strip of length N with different dominos of length k.
o We also have a SIMD accelerated LBR engine which handles bare LBRs in the Rose graph
(i.e. between literals)
* We have some ugly corner case handlers: We had something called "Puff The
Magic Engine" to handle a particular class of trailing bounded repeats. We also have
something called 'vacuous patterns' - this is something that matches on no data,
the most obvious being something like /.*/s which matches everywhere (101 times in a
100-character buffer). These engines got rolled into a multiple pattern matcher called
MegaPuffVac. Really. This is the sort of thing people don't bother with when they
build nice elegant academic systems. They probably also don't call things names like
MegaPuffVac.
* "Anchored acyclic" matchers are essentially start-anchored DFAs that
behave more like literals (i.e. they drive Rose) than engines. This has some minor
advantages - for example, an 'anchored acyclic' part can be split off from a NFA
engine even if it is not itself a workable literal. So we could trim off something like
/^\s\s\d/ from the start of a complex cyclic regex as if it's a literal in the context
of Rose.
* An SEP is a 'Short Exhaustible Passthough" (apologies to the late
lamented Douglas Adams). This is generally a short literal in single-match mode. It's
nicer to be able to turn off looking for these when we find the first one - generally
otherwise the matches from these constructs will impose overheads through our entire
system and be intercepted only just before we raise the match.
* We have a side smallwrites path for small writes when possible. The big, rather
elaborate Rose subsystem is a bit ponderous when you have 10 input bytes. So in block
mode, if we can just generate a simple DFA we do that, as long as it can be done. Once you
start a small write, you can't change your mind, so we don't use small write
engines for streaming mode, for fear that our efficient handling of the first 10 byte
write will look stupid if someone subsequently follows up with 1500 bytes.
* If we can establish that several engines can't possibly be running
simultaneously, we can put them in an engine that 'contains' another set of
engines. Our container engines are named for beaches in Sydney, "Tamarama" is
the name of our 'exclusivity engine'. This mainly saves us stream state as we can
alias the state that these engines use up. There are more 'container' engines
coming; these expose the engine API to the rest of the system and make use of the same API
in their sub-engines. It's not as elegantly recursive as it sounds, largely because
construction is hard.
* A few of our internal codenames obfuscate or riff on public facing names.
Hyperscan is not a name we chose or are particularly happy with, so for a long time the
engine was called UE2 ("Ultimate Engine The Second", which still persists in
various places, in this case, apologies to the late lamented Iain M. Banks). SINGLEMATCH
mode is called "highlander mode" internally ('there can be only one').
* SoM: "Start of Match": we do accurate, streamable start of match
tracking through our system. This is an option and occasionally an expensive one, and some
patterns we support in non-SoM mode we don't support in SoM mode. Anyone who thinks
SoM is cheap or easy is either smarter than we are, not doing things in streaming mode,
trimming down the definition of SoM (possibly defensibly), or misunderstands the problem.
There is a certain amount of weird structure allowing SoM information to flow through our
system; this is an area that needs work.
* mmbit: A nice pyramidal multi-level bit structure that allows us to represent
hundreds or thousands of bits in a way that allows fast access, iteration, sparse
iteration over a selected subset, etc. with very low initialization cost. This is
necessitated by people doing things like asking us to match 13,000 regular expressions
with 50K-odd .* constructs included - over what might be very short inputs, potentially in
streaming mode. For example, we can initialize one of these 50Kbit structures with a
single 64-bit write, set a small handful of bits, and navigate a partially initialized
structure with our sparse iterators without ever depending on invalid data... Bonus
tidbit: "mm" stands for "multimeat", an obsoleted structure of ours
from back when our NFA implementation was named "meatyoar". Honestly.
If you made it through that dense array of codenames, in-jokes, implementation details,
etc., congratulations - no doubt to all 3 of you still reading...
Geoff.