Monday 18 October 2010

Unique value row count in SAS

You may have come across the need to count unique values in your dataset and put that as a row in your dataset/table - something like this:

id, name
1, Alan
2, Brad
2, Brad
3, David
3, David
3, David
4, George
5, Joe
6, Steven
7, Zed
7, Zed

The solution involves using the LAG() function. Lag is like a look-back function and for each row that is processed it looks back a row and fetches the value. So, if I were at observation (row) 2 and did LAG(name), it would return the value of "Alan". The retain function puts an initial temporary value in a variable so I can use it in my processing.

Anyway, here's the code:

data pupils;
    input name $;
    datalines;
Alan
Brad
Brad

David
David
David
George
Joe
Steven
Zed
Zed
;
run;


data store;
    set pupils;

    prevname = lag(name);

    format id BEST12.;
    retain id 0;
     if name^=prevname then id = id + 1;

    drop prevname;
run;

Sunday 10 October 2010

Biggest Genome ever? It's plant vs amoeba

A few days ago there was an article in Science magazine that they had discovered a plant (Paris japonica) with the largest genome ever found with 149 billion base pairs...

Except, I remembered there's an amoeba (Polychaos dubium - what a bloody weird name by the way!) with an even bigger genome of 670 billion base pairs.

So it got me wondering - how could Science magazine get it so wrong?! Well, this comment on the science mag site gave the reason why they would reject dubium:


"While this and some other Amoeba have been reported to have such very large genomes, some caveats regarding their reliability are perhaps in order.
The measurement for Amoeba dubia and other protozoa which have been reported to have very large genomes were made in the 1960s using a rough biochemical approach which is now considered to be an unreliable method for accurate genome size determinations. The method uses whole cells rather than isolated nuclei and thus will include not only DNA from the mitochondria but also any DNA in engulfed food organisms. Also some of the species are multinucleate.
The accuracy of the genome size estimates are also called into question given that a related species, Amoeba proteus, which was reported to have a genome size of 300 pg was more recently shown to be an order of magnitude smaller (34 - 43 pg DNA per cell). Like the situation for dinoglagellates (see below), the genomes of these Amoeba are clearly large but to know just how big requires their genome sizes to be estimated using modern best practice techniques available today. Only then will it be possible to know just how their genomes compare in size to those of Paris japonica." {Quoted from Ilia}
The reasons given in the quote can be ignored except for the clincher which is in bold. If we assume the same experimental conditions and level of accuracy, dubium will be found to have about 69pg DNA.

Game over, the plant wins.