Geoff,
Thanks, this makes sense.
John
On Thu, Dec 20, 2018 at 2:46 AM Geoff Langdale <geoff.langdale(a)gmail.com>
wrote:
Unfortunately there is no consistent guarantee that the
HS_FLAG_PREFILTER
will operate at the level of the parse tree (which would allow relatively
easy round-tripping of the patterns back to the point they could be
concisely printed out). Back-references and arbitrary lookarounds are
snipped out in this fashion. However, very large bounded repeat expressions
are dealt with only at the NFA graph level and the NFA graph -> regex
transformation is ... non-trivial.
But the former kind of constructs could in fact be recoverable and could
be printed out (not with the current codebase, but it would be easy to
adapt).
For a fact, the bulk of unsupported constructs are pretty much just
quietly excised or replaced by something quite weak. Your example is a good
one - and would in fact just get converted to /<html>.*</html>/ - it would
not get converted to /<html>.*strong.*</html>/ as this imposes an ordering
that is not required by the original pattern (which would match /<html> ...
</html> ... strong/).
Regards,
Geoff.
On Thu, Dec 20, 2018 at 2:43 PM John Searles <jsearles(a)gmail.com> wrote:
> Geoff,
> Thank you for the quick response, I think I understand and maybe
> mis-phrased my question. Given a pattern compiled with HS_FLAG_PREFILTER is
> there a way to get out the equivalent broader pattern that is being looked
> for by Hyperscan? E.g. the documentation gives a potential example -- is
> there any way to get the actual before and after patterns? From the
> documentation: 'For example, the pattern /(\w+) again \1/ contains the
> back-reference \1. In prefiltering mode, this pattern might be approximated
> by having its back-reference replaced with its referent, forming /\w+ again
> \w+/.' Understanding that the example may or may not be how HS actually
> pre-filters, is there a possibility for a given compiled set of patterns to
> print out what the pre-filter ends up being. The goal is to identify if a
> given prefilter pattern will end up matching very frequently based on human
> visual inspection. Something like the pattern
> `<html>(?=.*strong)(?=.*great).*</html>` may be converted to
> `<html>.*</html>` which would match very often given the right corpus of
> data. Knowing what the pre-filter pattern that is generated is would be
> helpful in this case. Maybe `<html>.*strong.*</html>` is used and
that's
> significantly less frequent than most alternatives.
>
> Thank you,
> John Searles
>
>
> On Wed, Dec 19, 2018 at 8:31 PM Geoff Langdale <geoff.langdale(a)gmail.com>
> wrote:
>
>> HS_FLAG_PREFILTER doesn't do what you're imagining it does (which is
>> quite a reasonable use case; it's just that Hyperscan never was set up to
>> do it). HS_FLAG_PREFILTER just tells Hyperscan "it's OK to achieve
pattern
>> support or better performance by compromising on pattern matching accuracy,
>> as long as the compromise can only produce false postiives". Currently
>> Hyperscan does *not *do any altered match behaviour to improve
>> performance under conditions where Hyperscan would otherwise support the
>> pattern - HS_FLAG_PREFILTER is only there to allow unsupported constructs
>> like backreferences and arbitrary lookarounds to be supported in some sense
>> by Hyperscan.
>>
>> The idea of extracting prefilter strings from regexes isn't bad and we
>> had talked at one stage about providing an analysis call to do it. There's
>> no current way to do it, and much of the logic of how strings are chosen is
>> a bit opaque. For example, you might find that strings that are in patterns
>> that have fairly strong anchored prefixes (e.g' foobar' in
>> /^abc\d+foobar/s) may not necessarily be put into our floating literal
>> table. So the set of strings that Hyperscan chooses is an artifact of what
>> can be matched quickly and what Hyperscan needs (e.g. do we take two
>> strings from a pattern or one) as opposed to a 'pure' analysis of the
>> pattern. Does that make sense?
>>
>> Regards,
>> Geoff.
>>
>>
>> On Thu, Dec 20, 2018 at 12:16 PM John Searles <jsearles(a)gmail.com>
>> wrote:
>>
>>> I am looking to see if there is a way to get what the prefilter string
>>> hyperscan is using for a given regex when HS_FLAG_PREFILTER is being used.
>>> I tried looking, but could not find a way to do it.
>>>
>>> Thanks,
>>> John Searles
>>>
>>