Be Careful What You Match For, You Might Not Get It

So I ran into a really interesting issue in Java regular expression parsing while trying to work on an issue for a customer.

OpenNMS has the ability to listen for syslog messages, and turn them into OpenNMS events. To configure it, you specify a mapping of substring or regular expressions to UEIs (OpenNMS's internal event identifiers).

The customer saw a huge drop in performance from 1.8.0 to 1.8.1. Basically the only change to the syslog daemon was a change to use Matcher.find() instead of Matcher.matches(). The problem was that they were making regular expressions like this:

foo0: .*load test (\\S+) on ((pts\\/\\d+)|(tty\\d+))

...which weren't matching. So they changed it to put .* at the front, so matches() would get it:

.*foo0: .*load test (\\S+) on ((pts\\/\\d+)|(tty\\d+))

Upon upgrading to 1.8.1, they saw orders of magnitude slowdown. The reason is that when you haven't specified an anchor, find has to figure out the "right" starting point for the match. In doing so, it spins a LOT, compared to matches() and its implicit anchors. It's very expensive to scan all the way through the string, attempting to re-apply the regex, if it turns out there is no match. We figured this out this morning after I put together some benchmarks to show the differences:

---
regex = \s(19|20)\d\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])(\s+)(\S+)(\s)(\S.+)
input = <6>main: 2010-08-19 localhost foo23: load test 23 on tty1

matches = false: total time: 167, number per second: 5988023.9521
find = true: total time: 1264, number per second: 791139.2405
matches (.* at beginning and end) = true: total time: 2598, number per second: 384911.4704
find (.* at beginning and end) = true: total time: 2572, number per second: 388802.4883
matches (^.* at beginning, .*$ at end) = true: total time: 2918, number per second: 342700.4798
find (^.* at beginning, .*$ at end) = true: total time: 2648, number per second: 377643.5045

---
regex = \s(19|20)\d\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])(\s+)(\S+)(\s)(\S.+)
input = <6>main: 2010-08-01 localhost foo23: load test 23 on tty1

matches = false: total time: 128, number per second: 7812500.0000
find = true: total time: 1199, number per second: 834028.3570
matches (.* at beginning and end) = true: total time: 2570, number per second: 389105.0584
find (.* at beginning and end) = true: total time: 2554, number per second: 391542.6782
matches (^.* at beginning, .*$ at end) = true: total time: 2630, number per second: 380228.1369
find (^.* at beginning, .*$ at end) = true: total time: 2595, number per second: 385356.4547

---
regex = foo0: .*load test (\S+) on ((pts\/\d+)|(tty\d+))
input = <6>main: 2010-08-19 localhost foo23: load test 23 on tty1

matches = false: total time: 87, number per second: 11494252.8736
find = false: total time: 193, number per second: 5181347.1503
matches (.* at beginning and end) = false: total time: 1242, number per second: 805152.9791
find (.* at beginning and end) = false: total time: 28631, number per second: 34927.1768
matches (^.* at beginning, .*$ at end) = false: total time: 1241, number per second: 805801.7728
find (^.* at beginning, .*$ at end) = false: total time: 1242, number per second: 805152.9791

---
regex = foo23: .*load test (\S+) on ((pts\/\d+)|(tty\d+))
input = <6>main: 2010-08-19 localhost foo23: load test 23 on tty1

matches = false: total time: 85, number per second: 11764705.8824
find = true: total time: 873, number per second: 1145475.3723
matches (.* at beginning and end) = true: total time: 1812, number per second: 551876.3797
find (.* at beginning and end) = true: total time: 1879, number per second: 532197.9776
matches (^.* at beginning, .*$ at end) = true: total time: 1874, number per second: 533617.9296
find (^.* at beginning, .*$ at end) = true: total time: 1865, number per second: 536193.0295

---
regex = 1997
input = <6>main: 2010-08-19 localhost foo23: load test 23 on tty1

matches = false: total time: 80, number per second: 12500000.0000
find = false: total time: 215, number per second: 4651162.7907
matches (.* at beginning and end) = false: total time: 1339, number per second: 746825.9895
find (.* at beginning and end) = false: total time: 37722, number per second: 26509.7291
matches (^.* at beginning, .*$ at end) = false: total time: 1350, number per second: 740740.7407
find (^.* at beginning, .*$ at end) = false: total time: 1351, number per second: 740192.4500

The moral of the story is, if you're using Matcher.find(), use no anchors and no .*, but in all cases, you'll get the most deterministic behavior from always anchoring your regular expressions properly.

Share on Facebook

1 comment to Be Careful What You Match For, You Might Not Get It