Hello,
Hyperscan 5.2.0 brought with it the ability to compile pure literals,
e.g. hs_compile_lit_multi(). Other than the (arguable) convenience of
not having to encode some characters for PCRE, what is the advantage of
using the hs_compile_lit function(s) over just the hs_compile ones? Is
a pattern database of literals, compiled as such, optimized in such a
way as to perform better than if they had been compiled as regular
expressions?
Thank you.
-David Wharton
Show replies by date
Hi,
This issue is a bit related to the internal design.
Hyperscan is known to do a complete decomposition on regular expressions into different
components and thus need to perform a sophisticated global management over all the
matching behaviors for each component:
- The Hyperscan compile stage actually does a thorough analysis over regular expressions
to generate a bunch of execution sequences, which is called "instructions".
(refer to src/rose/rose_program.h for all the defined instructions). These execution
sequences will be stored in the database when the compilation is finished. To be a little
more specific, different instruction sequences may serve for different goals, and such a
sequence is called "program". Different program may cover different types of
instructions.
- In the runtime stage, the predefined programs from database just work as a guidance of
all the matching stuff. Current implementation uses one huge switch-case to deal with all
the possible instructions from all the programs in an infinite loop (refer to
src/rose/program_runtime.c for the last long function).
However, this runtime implementation focuses on a general use case of regular expression
set. But if considering a pure literal set, we realized that not all the instructions are
literal-related actions, as many of them serves for other purpose like regex engine
triggering. If matching a pure literal set in such a huge switch-case, many branches are
actually useless and performance may be affected. So this is another reason to bring out a
new clean switch-case implementation with less branches and a new API for pure literal
set, besides the reason that we do meet some requirements to match pure literal sets from
users.
Thanks,
Yang
-----Original Message-----
From: David Wharton <hyperscan(a)davidwharton.us>
Sent: Wednesday, November 11, 2020 9:32 AM
To: hyperscan(a)lists.01.org
Subject: [Hyperscan] Performance of pure literal pattern database vs regular expressions
Hello,
Hyperscan 5.2.0 brought with it the ability to compile pure literals, e.g.
hs_compile_lit_multi(). Other than the (arguable) convenience of not having to encode
some characters for PCRE, what is the advantage of using the hs_compile_lit function(s)
over just the hs_compile ones? Is a pattern database of literals, compiled as such,
optimized in such a way as to perform better than if they had been compiled as regular
expressions?
Thank you.
-David Wharton
_______________________________________________
Hyperscan mailing list -- hyperscan(a)lists.01.org To unsubscribe send an email to
hyperscan-leave(a)lists.01.org