Be Careful What You Match For, You Might Not Get It

So I ran into a really interesting issue in Java regular expression parsing while trying to work on an issue for a customer.

OpenNMS has the ability to listen for syslog messages, and turn them into OpenNMS events. To configure it, you specify a mapping of substring or regular expressions to UEIs (OpenNMS’s internal event identifiers).

The customer saw a huge drop in performance from 1.8.0 to 1.8.1. Basically the only change to the syslog daemon was a change to use Matcher.find() instead of Matcher.matches(). The problem was that they were making regular expressions like this:

foo0: .*load test (\\S+) on ((pts\\/\\d+)|(tty\\d+))

…which weren’t matching. So they changed it to put .* at the front, so matches() would get it:

.*foo0: .*load test (\\S+) on ((pts\\/\\d+)|(tty\\d+))

Upon upgrading to 1.8.1, they saw orders of magnitude slowdown. The reason is that when you haven’t specified an anchor, find has to figure out the “right” starting point for the match. In doing so, it spins a LOT, compared to matches() and its implicit anchors. It’s very expensive to scan all the way through the string, attempting to re-apply the regex, if it turns out there is no match. We . . . → Read More: Be Careful What You Match For, You Might Not Get It

Share on Facebook