Tuesday 18 January 2011

Buggy JavaScript Regex and parsing speeds in Firefox, Chrome, Opera and IE6

I was working on a little section of a small javascript application I'm writing when I came across this issue and I've decided it's worthy of documenting in the public domain.

The problem: This Javascript fragment of a regular expression replace statement requires exponential run-time to parse a string the longer it gets:

input.replace(/([^\s]+)*&radic;\{(.*?)\}/gi, '<sup>$1</sup>&radic;<span class="ol">$2</span>')

It replaces an example string - "&radic;{123}" - with this - "<sup></sup>&radic;<span class="ol">123</span>".

But with each additional character added inside the curly braces the time taken to replace it increases exponentially. I have 4 browsers installed on my system - Firefox 3.6.13, Google Chrome 5, Opera 11 and Internet Explorer 6. I haven't upgraded IE6 to the latest version because I use it to test how websites look for people using business/government computers whose technology department haven't upgraded the browsers in 10 years! (Shame on you!)

The results are surprising to say the least. The table and charts below show how long it took (in milliseconds) for each browser to parse a string with 1 to 20 characters inside the curly braces and output it to the screen.


 
Str Length Time FF Time CH Time OP Time IE6
1 5 0 0 0
2 5 1 0 0
3 5 0 0 0
4 6 1 0 0
5 8 1 0 0
6 12 1 0 0
7 19 1 0 0
8 33 3 0 0
9 60 5 0 0
10 115 8 0 0
11 224 17 0 0
12 448 35 0 15
13 892 71 0 0
14 1773 143 0 0
15 3544 286 0 15
16 7113 576 0 0
17 14167 1146 0 31
18 28356 2287 0 47
19 58691 4593 0 94
20 113196 9159 0 188




The results are very surprising. At 13 characters long, the browser suddenly starts becoming very slow and a pattern becomes apparent - that the time taken to execute the regex statement is taking exponentially longer.

IE6 only really starts doing this at 17+ characters but is, surprisingly, faster that both Firefox and Chrome! IE6 faster than Chrome! OMG! Oddly, IE6 spontaneously spikes to 15ms every now and again before reaching 17 characters whether 1 or 16 characters are entered.

Another thing to note is that Chrome is more than 10x faster at doing this than Firefox. The JagerMonkey needs to go on a JagerDiet.

What is even more surprising than that is that execution time in Opera is very close to 0ms and does not rise at all! In Opera this operation isn't exponential! I tested it with over 100 characters and performance in Opera is amazing going above no more than 1ms. (Much respect to the Opera Dev team and their Carakan engine!)

When I made this mistake I was really surprised why it took so long but quickly realised my mistake. I can't really explain what's going on under the hood of the regex engine (because I don't really know in that much detail) but I identified the bug in my regex and solved the problem, which is the single asterisk (*) highlighted:

input.replace(/([^\s]+)*&radic;\{(.*?)\}/gi, '<sup>$1</sup>&radic;<span class="ol">$2</span>')

and which should be replaced with a question mark to solve the problem.

input.replace(/([^\s]+)?&radic;\{(.*?)\}/gi, '<sup>$1</sup>&radic;<span class="ol">$2</span>')

What the regex code is doing, and don't blame me if I'm wrong - I'm no expert - is that when you use the * operator it checks to see if the pattern before it exists and it does this by trying to match the whole length of the string with the pattern [^\s] (which means any character but empty space) and then it reads the string one character at a time until it reaches the end and then it back tracks the length of the string to the beginning. It's a greedy operator which tries to find the longest possible matching string. After that, the rest of the regex is interpreted accordingly.

When you replace * with a ? it checks to see if there's anything before &radic; and if there isn't it simply moves on to carry out the rest of the regex. It only checks one character minimum before it moves on while using * checks the length of the string with each character it moves along. That's why using * becomes slow the longer the string gets.

I think the reason Opera is not affected by this coding bug is because their internal regex engine identifies the bug and realizes that it's stupid for the greedy operator to check the whole string when the next regex pattern after it matches instantly so internally it treats the * operator like a ? operator, thus reducing execution time. They have very clever people at Opera. Mozilla and Google should really think about incorporating this performance increasing technique into their own engines.

Saturday 8 January 2011

Dolly the sheep lives

...on through her clones.


Dolly the sheep has been reborn. Four clones have been made by the scientist behind the original research.
The quads, which have been nicknamed ‘the Dollies’, are exact genetic copies of their predecessor, who was put down seven years ago.

Wednesday 5 January 2011

SAS Macro functions and returning values and a bit about macro variable type casting

I wrote an article about this subject before but I was quite new to SAS and it was quite confusing so I'm writing it again better and clearer this time.

The SAS macro programming language is a strange language lacking data types we usually associate with strongly or even weakly typed programming languages. SAS macro variables are intrinsically treated as strings. The equivalent command for casting a type from one to another is an input() command. For example, if I have a string that contains the month and year (MMMYY), then I can convert it to a SAS date format like this:

%let mmmyy = JAN11;
%let datum = %sysfunc(input(&mmmyy., MONYY.)); 
*the second argument defines the format of the data (&mmmyy.) that you want to cast into SAS format. &datum. now holds the SAS date equivalent of 01/01/2011;

Datum now holds the SAS date format of the first of January 2011. SAS macro variable values like the integer date stored in datum are intrinsically stored as strings, so you can't just do anything with them. In order to treat them as an actual integer and do any mathematical operations on them you need to use eval() like so:

%let yesteryear = %eval(&datum. - 1); *holds the SAS date for 31/12/2010;

One problem often encountered with SAS macro variables is when working with dates. Often, people define a date like so:

%let bankholiday = "03JAN11"d;

This is acceptable, but if you try to do any mathematical operations using that variable outside of a data step it will throw an error message and literally show that you tried to carry out an operation on the string ' "03JAN11"d '. The best ways to convert such a date to the SAS date format is to use either intnx() or define the date using mdy():

%let bankholiday = %sysfunc(mdy(01, 03, 11));
or
%let bankholiday = %sysfunc((intnx(day, "03JAN11"d, 0)));

I've yet to figure out how to detect the type of the macro variable before I do anything with it, so I advise you to inform other programmers about the type of the variable your code works with before you let someone else use your code.

Now. When it comes to macro functions, SAS also proves to be strange when compared to other languages. What is a very natural and normal way of working in a procedural/Object Oriented language appears to be missing in SAS. I've yet to see an official documented example that shows a macro returning a value, but it is possible. I doubt I'm the first person to discover or use this method.

What is odd about macro functions that return values is that you should NEVER use single line comments inside the body of the macro otherwise it throws an error! Always use the multi-line comments like /* this comment */. And there is no "return" command preceding the variable to make it return but there is a %return statement and all this does is stop the execution of the rest of the code and jump out of the macro. The way to return a value is to just write the the name of the variable holding the value you wish to return, without a semicolon on the end.

Anyway. Here's an example bit of code that lets you ask if a date is a weekend or not. If it is a weekend it returns 1 else it returns 0.

/**
* %isWeekend(datum)
* Tells you if a date is a weekend or not by returning 1 (true) or 0 (false). If you leave it empty it will check todays date.
* Usage:
%let datum = %sysfunc(mdy(01,01,2011));
%put %isWeekend(&datum.); *shows 1 (true);
*
* @return    boolean    If the date you supplied is a weekend it will return 1, otherwise 0.
* @param    date    date    The date you want to check. If you leave this empty it will use todays date
* @date 20110105
* @author Ahmad Retha
**/

%macro isWeekend(date);
    %if date=  %then %do;
        %let date = %sysfunc(today());
    %end;

    %let wd = %sysfunc(weekday(&date.)) ; /*find the weekday of the date given. Sunday=1, Saturday=7*/

    %let iwe = 0; /*set the variable we wish to retune to 0 (false) initially*/
    %if &wd.=1 or &wd.=7 %then %do;
        %let iwe = 1;
    %end;

    /* return iwe (1 or 0) */
    &iwe.
%mend;


Allow me to explain what this code does. The first part of the code, the %if statement, checks to see if a date argument was supplied and if it isn't it uses today's date. This behaviour of setting an empty parameter to a default value makes our macro function more robust and easy to use - it is a recommended practice.

The next section assigns the weekday, Sunday, Monday, Tuesday through to Saturday to the variable &wd. as a number from 1 to 7, as there are 7 days in a week. In SAS, the week starts on Sunday, 1, and ends on Saturday, 7.

Next we create a macro variable, iwe, which holds the value we wish to to return. We want to return 1 (true) if the date is a weekend, or 0 (false) if it's not. Initially we set the value to 0 (false) as most days of the week return false.

The next part checks if the weekday is Sunday (1) or Saturday(7) and if the date's day is a weekend it sets iwe to 1 (true). If it's not a weekend then iwe already holds 0 (false).

Finally, we return the value held in iwe (either 1 or 0) by just writing the variable &iwe. by itself - note that you should NOT put a semicolon on the end otherwise it will throw an error. There is no "return" command - just put the variable name on a line of its own.

That's it. Easy right?

You can write many really useful macro functions that you will use often and stick them into a file and make a library of useful functions to %include and use in your projects. For example, for my job I wrote a macro called %isWeekend(date), like the one above (though as I'm writing this from home I couldn't just copy and paste and I wrote the above on the fly from memory), another macro called %getNext(day) which returns the date of the next weekday you give it, %getLast(day) which is similar but looks backwards, %isHoliday(date) and %getLastWorkday(date). I stuck those into a SAS file, a library, and now include it into my other scripts when I need the functionality. This approach promotes code re-use and makes coding quicker and easier and as there is less duplicate code it is easier to maintain. Naturally, all my code is highly documented and I recommend you comment up your code as well.

I hope this little tutorial proves useful.

- Ahmad Retha