Tuesday 18 January 2011

Buggy JavaScript Regex and parsing speeds in Firefox, Chrome, Opera and IE6

I was working on a little section of a small javascript application I'm writing when I came across this issue and I've decided it's worthy of documenting in the public domain.

The problem: This Javascript fragment of a regular expression replace statement requires exponential run-time to parse a string the longer it gets:

input.replace(/([^\s]+)*&radic;\{(.*?)\}/gi, '<sup>$1</sup>&radic;<span class="ol">$2</span>')

It replaces an example string - "&radic;{123}" - with this - "<sup></sup>&radic;<span class="ol">123</span>".

But with each additional character added inside the curly braces the time taken to replace it increases exponentially. I have 4 browsers installed on my system - Firefox 3.6.13, Google Chrome 5, Opera 11 and Internet Explorer 6. I haven't upgraded IE6 to the latest version because I use it to test how websites look for people using business/government computers whose technology department haven't upgraded the browsers in 10 years! (Shame on you!)

The results are surprising to say the least. The table and charts below show how long it took (in milliseconds) for each browser to parse a string with 1 to 20 characters inside the curly braces and output it to the screen.


 
Str Length Time FF Time CH Time OP Time IE6
1 5 0 0 0
2 5 1 0 0
3 5 0 0 0
4 6 1 0 0
5 8 1 0 0
6 12 1 0 0
7 19 1 0 0
8 33 3 0 0
9 60 5 0 0
10 115 8 0 0
11 224 17 0 0
12 448 35 0 15
13 892 71 0 0
14 1773 143 0 0
15 3544 286 0 15
16 7113 576 0 0
17 14167 1146 0 31
18 28356 2287 0 47
19 58691 4593 0 94
20 113196 9159 0 188




The results are very surprising. At 13 characters long, the browser suddenly starts becoming very slow and a pattern becomes apparent - that the time taken to execute the regex statement is taking exponentially longer.

IE6 only really starts doing this at 17+ characters but is, surprisingly, faster that both Firefox and Chrome! IE6 faster than Chrome! OMG! Oddly, IE6 spontaneously spikes to 15ms every now and again before reaching 17 characters whether 1 or 16 characters are entered.

Another thing to note is that Chrome is more than 10x faster at doing this than Firefox. The JagerMonkey needs to go on a JagerDiet.

What is even more surprising than that is that execution time in Opera is very close to 0ms and does not rise at all! In Opera this operation isn't exponential! I tested it with over 100 characters and performance in Opera is amazing going above no more than 1ms. (Much respect to the Opera Dev team and their Carakan engine!)

When I made this mistake I was really surprised why it took so long but quickly realised my mistake. I can't really explain what's going on under the hood of the regex engine (because I don't really know in that much detail) but I identified the bug in my regex and solved the problem, which is the single asterisk (*) highlighted:

input.replace(/([^\s]+)*&radic;\{(.*?)\}/gi, '<sup>$1</sup>&radic;<span class="ol">$2</span>')

and which should be replaced with a question mark to solve the problem.

input.replace(/([^\s]+)?&radic;\{(.*?)\}/gi, '<sup>$1</sup>&radic;<span class="ol">$2</span>')

What the regex code is doing, and don't blame me if I'm wrong - I'm no expert - is that when you use the * operator it checks to see if the pattern before it exists and it does this by trying to match the whole length of the string with the pattern [^\s] (which means any character but empty space) and then it reads the string one character at a time until it reaches the end and then it back tracks the length of the string to the beginning. It's a greedy operator which tries to find the longest possible matching string. After that, the rest of the regex is interpreted accordingly.

When you replace * with a ? it checks to see if there's anything before &radic; and if there isn't it simply moves on to carry out the rest of the regex. It only checks one character minimum before it moves on while using * checks the length of the string with each character it moves along. That's why using * becomes slow the longer the string gets.

I think the reason Opera is not affected by this coding bug is because their internal regex engine identifies the bug and realizes that it's stupid for the greedy operator to check the whole string when the next regex pattern after it matches instantly so internally it treats the * operator like a ? operator, thus reducing execution time. They have very clever people at Opera. Mozilla and Google should really think about incorporating this performance increasing technique into their own engines.

2 comments:

  1. Read the O'Rielly Mastering Regex Book. It explains in detail the issues with exponential parsing times. There are three main types of regex engine: Non-deterministic Finite Automata (NFA), Deterministic Finite Automata (DFA), and Posix. Nobody uses Posix, most implementations are NFA which can run exponentially for certain expressions. DFA is not exponential but doesnt support certain features such as capture groups.

    ReplyDelete
  2. Thanks for commenting. That sounds very interesting and I will take a look through that book once I've gone through a ton of other stuff I've got on my reading list. Ta.

    ReplyDelete