So I ran into a really interesting issue in Java regular expression parsing while trying to work on an issue for a customer.

OpenNMS has the ability to listen for syslog messages, and turn them into OpenNMS events. To configure it, you specify a mapping of substring or regular expressions to UEIs (OpenNMS's internal event identifiers).

The customer saw a huge drop in performance from 1.8.0 to 1.8.1. Basically the only change to the syslog daemon was a change to use Matcher.find() instead of Matcher.matches(). The problem was that they were making regular expressions like this:

foo0: .*load test (\\S+) on ((pts\\/\\d+)|(tty\\d+))

...which weren't matching. So they changed it to put .* at the front, so matches() would get it:

.*foo0: .*load test (\\S+) on ((pts\\/\\d+)|(tty\\d+))

Upon upgrading to 1.8.1, they saw orders of magnitude slowdown. The reason is that when you haven't specified an anchor, find has to figure out the "right" starting point for the match. In doing so, it spins a LOT, compared to matches() and its implicit anchors. It's very expensive to scan all the way through the string, attempting to re-apply the regex, if it turns out there is no match. We figured this out this morning after I put together some benchmarks to show the differences:

---

regex = \s(19|20)\d\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])(\s+)(\S+)(\s)(\S.+)

input = <6>main: 2010-08-19 localhost foo23: load test 23 on tty1matches = false: total time: 167, number per second: 5988023.9521

find = true: total time: 1264, number per second: 791139.2405

matches (.* at beginning and end) = true: total time: 2598, number per second: 384911.4704

find (.* at beginning and end) = true: total time: 2572, number per second: 388802.4883

matches (^.* at beginning, .*$ at end) = true: total time: 2918, number per second: 342700.4798

find (^.* at beginning, .*$ at end) = true: total time: 2648, number per second: 377643.5045---

regex = \s(19|20)\d\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])(\s+)(\S+)(\s)(\S.+)

input = <6>main: 2010-08-01 localhost foo23: load test 23 on tty1matches = false: total time: 128, number per second: 7812500.0000

find = true: total time: 1199, number per second: 834028.3570

matches (.* at beginning and end) = true: total time: 2570, number per second: 389105.0584

find (.* at beginning and end) = true: total time: 2554, number per second: 391542.6782

matches (^.* at beginning, .*$ at end) = true: total time: 2630, number per second: 380228.1369

find (^.* at beginning, .*$ at end) = true: total time: 2595, number per second: 385356.4547---

regex = foo0: .*load test (\S+) on ((pts\/\d+)|(tty\d+))

input = <6>main: 2010-08-19 localhost foo23: load test 23 on tty1matches = false: total time: 87, number per second: 11494252.8736

find = false: total time: 193, number per second: 5181347.1503

matches (.* at beginning and end) = false: total time: 1242, number per second: 805152.9791

find (.* at beginning and end) = false: total time: 28631, number per second: 34927.1768

matches (^.* at beginning, .*$ at end) = false: total time: 1241, number per second: 805801.7728

find (^.* at beginning, .*$ at end) = false: total time: 1242, number per second: 805152.9791---

regex = foo23: .*load test (\S+) on ((pts\/\d+)|(tty\d+))

input = <6>main: 2010-08-19 localhost foo23: load test 23 on tty1matches = false: total time: 85, number per second: 11764705.8824

find = true: total time: 873, number per second: 1145475.3723

matches (.* at beginning and end) = true: total time: 1812, number per second: 551876.3797

find (.* at beginning and end) = true: total time: 1879, number per second: 532197.9776

matches (^.* at beginning, .*$ at end) = true: total time: 1874, number per second: 533617.9296

find (^.* at beginning, .*$ at end) = true: total time: 1865, number per second: 536193.0295---

regex = 1997

input = <6>main: 2010-08-19 localhost foo23: load test 23 on tty1matches = false: total time: 80, number per second: 12500000.0000

find = false: total time: 215, number per second: 4651162.7907

matches (.* at beginning and end) = false: total time: 1339, number per second: 746825.9895

find (.* at beginning and end) = false: total time: 37722, number per second: 26509.7291

matches (^.* at beginning, .*$ at end) = false: total time: 1350, number per second: 740740.7407

find (^.* at beginning, .*$ at end) = false: total time: 1351, number per second: 740192.4500

The moral of the story is, if you're using Matcher.find(), use no anchors and no .*, but in all cases, you'll get the most deterministic behavior from always anchoring your regular expressions properly.

Share on Facebook
[…] This post was mentioned on Twitter by Benjamin Reed, planettrilug. planettrilug said: New planet post: Ben Reed: Be Careful What You Match For, You Might Not Get It: So I ran into a really interesting… http://bit.ly/99hFJ2 […]