Ahmad's Blog: Buggy JavaScript Regex and parsing speeds in Firefox, Chrome, Opera and IE6

I was working on a little section of a small javascript application I'm writing when I came across this issue and I've decided it's worthy of documenting in the public domain.

The problem: This Javascript fragment of a regular expression replace statement requires exponential run-time to parse a string the longer it gets:

input.replace(/([^\s]+)*&radic;\{(.*?)\}/gi, '<sup>$1</sup>&radic;<span class="ol">$2</span>')

It replaces an example string - "√{123}" - with this - "<sup></sup>√<span class="ol">123</span>".

But with each additional character added inside the curly braces the time taken to replace it increases exponentially. I have 4 browsers installed on my system - Firefox 3.6.13, Google Chrome 5, Opera 11 and Internet Explorer 6. I haven't upgraded IE6 to the latest version because I use it to test how websites look for people using business/government computers whose technology department haven't upgraded the browsers in 10 years! (Shame on you!)

The results are surprising to say the least. The table and charts below show how long it took (in milliseconds) for each browser to parse a string with 1 to 20 characters inside the curly braces and output it to the screen.

Str Length	Time FF	Time CH	Time OP	Time IE6
1	5	0	0	0
2	5	1	0	0
3	5	0	0	0
4	6	1	0	0
5	8	1	0	0
6	12	1	0	0
7	19	1	0	0
8	33	3	0	0
9	60	5	0	0
10	115	8	0	0
11	224	17	0	0
12	448	35	0	15
13	892	71	0	0
14	1773	143	0	0
15	3544	286	0	15
16	7113	576	0	0
17	14167	1146	0	31
18	28356	2287	0	47
19	58691	4593	0	94
20	113196	9159	0	188

The results are very surprising. At 13 characters long, the browser suddenly starts becoming very slow and a pattern becomes apparent - that the time taken to execute the regex statement is taking exponentially longer.

IE6 only really starts doing this at 17+ characters but is, surprisingly, faster that both Firefox and Chrome! IE6 faster than Chrome! OMG! Oddly, IE6 spontaneously spikes to 15ms every now and again before reaching 17 characters whether 1 or 16 characters are entered.

Another thing to note is that Chrome is more than 10x faster at doing this than Firefox. The JagerMonkey needs to go on a JagerDiet.

What is even more surprising than that is that execution time in Opera is very close to 0ms and does not rise at all! In Opera this operation isn't exponential! I tested it with over 100 characters and performance in Opera is amazing going above no more than 1ms. (Much respect to the Opera Dev team and their Carakan engine!)

When I made this mistake I was really surprised why it took so long but quickly realised my mistake. I can't really explain what's going on under the hood of the regex engine (because I don't really know in that much detail) but I identified the bug in my regex and solved the problem, which is the single asterisk (*) highlighted:

input.replace(/([^\s]+)*&radic;\{(.*?)\}/gi, '<sup>$1</sup>&radic;<span class="ol">$2</span>')

and which should be replaced with a question mark to solve the problem.

input.replace(/([^\s]+)?&radic;\{(.*?)\}/gi, '<sup>$1</sup>&radic;<span class="ol">$2</span>')

What the regex code is doing, and don't blame me if I'm wrong - I'm no expert - is that when you use the * operator it checks to see if the pattern before it exists and it does this by trying to match the whole length of the string with the pattern [^\s] (which means any character but empty space) and then it reads the string one character at a time until it reaches the end and then it back tracks the length of the string to the beginning. It's a greedy operator which tries to find the longest possible matching string. After that, the rest of the regex is interpreted accordingly.

When you replace * with a ? it checks to see if there's anything before √ and if there isn't it simply moves on to carry out the rest of the regex. It only checks one character minimum before it moves on while using * checks the length of the string with each character it moves along. That's why using * becomes slow the longer the string gets.

I think the reason Opera is not affected by this coding bug is because their internal regex engine identifies the bug and realizes that it's stupid for the greedy operator to check the whole string when the next regex pattern after it matches instantly so internally it treats the * operator like a ? operator, thus reducing execution time. They have very clever people at Opera. Mozilla and Google should really think about incorporating this performance increasing technique into their own engines.

2 comments:

Anonymous24 May 2011 at 03:42
Read the O'Rielly Mastering Regex Book. It explains in detail the issues with exponential parsing times. There are three main types of regex engine: Non-deterministic Finite Automata (NFA), Deterministic Finite Automata (DFA), and Posix. Nobody uses Posix, most implementations are NFA which can run exponentially for certain expressions. DFA is not exponential but doesnt support certain features such as capture groups.
A. Retha1 June 2011 at 00:16
Thanks for commenting. That sounds very interesting and I will take a look through that book once I've gone through a ton of other stuff I've got on my reading list. Ta.

Ahmad's Blog

Tuesday, 18 January 2011

Buggy JavaScript Regex and parsing speeds in Firefox, Chrome, Opera and IE6

2 comments:

About Me