Friday, 1 December 2017
Getting DNA out of a FASTA file by position or chromosome id
For extracting a chromosome from a genome file, like the 900MB GRCh37.gz file provided by the 1000 genomes project, which contains all the human chromosomes in one file - I created a program to extract individual chromosomes to a FASTA file. You don't even need to extract the original .gz file - it handles it as is. Check it out here: https://github.com/webmasterar/extractChromosome
For extracting a string of bases from a chromosome stored in a FASTA file just by providing the start and end positions, I created a little Python helper function called getRef() and you can access it here: https://gist.github.com/webmasterar/3a60155d4ddc8595b17fa2c62893dbb0
It is easy to use and takes three arguments: getRef(Chr_File, Start_Pos, End_Pos).
Saturday, 6 August 2016
Pattern matching with degenerate characters and don't-cares using bit-vector encoding
In pattern matching we have a text T of length n and a pattern X of length m ≤ n. In exact pattern matching we want to find X in T exactly as it is. In approximate pattern matching we want to find X in T but it doesn't need to match exactly. We define a threshold value k which is the maximum number of optimal changes that can be made to X to make it match a part of T. If more than k changes are required to make X match a part of T then we conclude that it is not in fact a factor of T and move on to look at the next position.
Exact pattern matching is easily solved using the naive method in worst-case O(nm) time. Of course, better algorithms exist to reduce this complexity. The KMP algorithm is famous but it is not the best one to use. Other algorithms include Shift-And/Shift-Or, Horspool, BNDM and BOM, of which the last two are the most effective. They achieve their efficiency using a combination of text skipping - moving past parts of the text that is impossible to hold the pattern - and bit-parallel manipulation. Bit-parallelism takes advantage of the computer processor's ability to quickly recall items from registers and cache and manipulate it in parallel using simple shift operations and logic gate calculations. Exact pattern matching is also quickly achieved using on-line indexing methods such as Aho-Corasick Automata or the suffix tree/array approach which speed up multiple pattern look up.
Approximate pattern matching traditionally relies on Dynamic Programming approaches which have an O(nm) time complexity. Improvements often rely on bit-parallelism, avoiding unnecessary processing by filtering out parts of the text and various clever tricks relying on low k.
In this blog post I will show how to simulate approximate pattern matching using the naive exact pattern matching algorithm. One presumes it is possible to use better exact pattern-matching algorithms to achieve faster results, but the goal of this post is not to make the ultimate algorithm. Instead, it is to explain how to use bit-parallelism to achieve pattern matching for degenerate and don't-care characters.
I am using the DNA/RNA alphabet for this example. DNA consists of the characters A, C, G, and T. RNA substitutes T for U, making its alphabet A, C, G, U. The characters represent nucleotide bases, for example, A stands for Adenine. We introduce a few extra degenerate characters and a don't care character (N) as defined in the IUPAC Alphabet standard:
A.................Adenine
C.................Cytosine
G.................Guanine
T.................Thymine
U.................Uracil
R.................A or G
Y.................C or T
S.................G or C
W.................A or T
K.................G or T
M.................A or C
B.................C or G or T
D.................A or G or T
H.................A or C or T
V.................A or C or G
N.................any base
I call a character like R or B degenerate, meaning it can represent more than one single alphabet character. I call N a don't care character because it can represent any base. Note that there are 4 core letters (A[T|U]CG), assuming that T is equivalent to U and 15 characters all together considering. So why do we have this special IUPAC standard? Well, one reason is because there are enzymes that recognise a segment of DNA with some flexibility. The restriction enzyme EcoRII's specificity subunit recognises and cuts DNA when it finds the site CCWGG. W could be A or T. It is useful for biologists to know where it makes its cuts in the genome. Another reason the IUPAC alphabet might be useful is to define DNA sequences where the exact identity of a base is unclear or may contain errors when it is first sequenced.
We can represent every IUPAC character using a binary representation different to the ASCII standard. Through this binary representation and clever use of bit masks it is possible to make R = A or R = G, or N = any base.
First I create an array containing 122 elements whose index match to the ASCII numerical value of all the letters above. For example 'A' is index 65, 'B' is index 66 and so on. I could have made this a smaller array taking only the range A - Y, but I do not like using subtract all the time in my code. Thus, most of the positions will hold the value 0, but for A, I define its value to be 1, 0001 in binary. In fact, I define the 4 primary nucleotide bases as such:
NUC['A'] = 1; //00000001
NUC['C'] = 2; //00000010
NUC['G'] = 4; //00000100
NUC['T'] = 8; //00001000
NUC['U'] = 8; //00001000
Notice that the position of the binary ones make each character distinct. For each distinct character a new bit is needed. In our case with an effective alphabet size of 4, it can all fit into the first nibble of a byte.
Now we design the degenerate characters by making them bit-masks of the characters they can represent.
NUC['R'] = 5; //00000101 - a/g
NUC['Y'] = 10; //00001010 - c/t
NUC['S'] = 6; //00000110 - g/c
NUC['W'] = 9; //00001001 - a/t
NUC['K'] = 12; //00001100 - g/t
NUC['M'] = 3; //00000011 - a/c
NUC['B'] = 14; //00001110 - c/g/t
NUC['D'] = 13; //00001101 - a/g/t
NUC['H'] = 11; //00001011 - a/c/t
NUC['V'] = 7; //00000111 - a/c/g
NUC['N'] = 15; //00001111 - a/c/g/t
Before we can use this, I need to describe how to find an exact pattern in a text using the regular method. Let's say we have a text T = "GACCAGGAG" and a pattern X = "CCWGG" and we need to find X in T. But because the naive method cannot recognise the degenerate W character in T, it would not be sensible to use X as is. In fact, you would be forced to do two searches, once using CCAGG and the next time CCTGG. This is the naive search method:
i = 0
while i <= n - m:
j = 0
while j < m and T[i+j] == X[j]:
j += 1
if j == m:
print "Match found at position: ", i
i += 1
We use the variable i to keep track of the position in T and j the position in X. The outer while loop moves the pointer i through T one position at a time. So for every position in T, the inner while loop tries to move through X. It only continues to loop through X as long as the current character in T and subsequent characters match that of X until j is equal to the length of the pattern, in which case we record a match. If before j == m there is a mismatch, it breaks out of the inner loop and moves to the next position in T to keep looking for the pattern.
OK. If that is sufficiently clear, we can modify the above exact pattern matching strategy to match degenerate characters and don't cares based on the following principles.
Each character A[T|U]CG in the NUC array we defined above has a unique binary code with the position of its binary 1 differing. We use logical AND (&) to check if one character matches another.
NUC['A']: 00000001
& NUC['A']: 00000001
--------------------
00000001
NUC['A']: 00000001
& NUC['C']: 00000010
--------------------
00000000
So when A matches A, the resulting character is A and when it doesn't, it gives a different character (0). With degenerate characters we do:
NUC['A']: 00000001
& NUC['W']: 00001001
--------------------
00000001
NUC['T']: 00001000
& NUC['W']: 00001001
--------------------
00001000
NUC['C']: 00000010
& NUC['W']: 00001001
--------------------
00000000
As you can see, W matches either A or T to give the character it was tested with, and when compared with C, a character it does not represent, it does not match it. The don't care character N will match A, C, G or T/U because it's bit pattern is 00001111. So if we modify our naive algorithm above, we can change the condition of the inner loop so that if a character from X, whether it is A[T|U]CG or a degenerate character, when ANDed with a character from T results in the same character from T, then it is a match.
For ease, I create an array P to hold the binary representation of the characters in X as defined in the NUC array.
P = []
for j = 0 to m - 1:
P.append(NUC[X[j]])
Then I can easily use it in the modified naive method, which this time supports degenerate and don't care characters:
i = 0
while i <= n - m:
j = 0
while j < m and (NUC[T[i+j]] & P[j]) == NUC[T[i+j]]:
j += 1
if j == m:
print "Match found at position: ", i
i += 1
So there it is. I searched with X = "CCWGG" and it found CCAGG in T. If I had CCTGG in T it would have found it as well. Also, if I didn't care about the value of one or two of the letters, then I could have used N in the pattern and it would have found the pattern I was looking for.
The naive method is limited to searching at O(nm) time complexity in the worst case, but if this technique is incorporated into a better exact string matching algorithm, it will be able to do it much quicker and with the added benefit of finding patterns using a degenerate character set which supports don't cares.
Wednesday, 10 March 2010
Ancient DNA
http://scienceblogs.com/notrocketscience/2010/03/dna_from_the_largest_bird_ever_sequenced_from_fossil_eggshel.php
http://www.newscientist.com/article/mg14119104.600-fact-fiction-and-fossil-dna-analysis-of-ancient-dna-should-give-clues-about-the-origin-of-species-and-how-they-evolved-over-time-but-only-if-the-dna-really-is-ancient.html?full=true
http://findarticles.com/p/articles/mi_m1200/is_n21_v146/ai_15951710/
http://www.dailymail.co.uk/sciencetech/article-1026340/Jurassic-Park-comes-true-How-scientists-bringing-dinosaurs-life-help-humble-chicken.html
Monday, 1 February 2010
I, virus: Why you're only half human
WHEN, in 2001, the human genome was sequenced for the first time, we were confronted by several surprises. One was the sheer lack of genes: where we had anticipated perhaps 100,000 there were actually as few as 20,000. A bigger surprise came from analysis of the genetic sequences, which revealed that these genes made up a mere 1.5 per cent of the genome. This is dwarfed by DNA deriving from viruses, which amounts to roughly 9 per cent.
On top of that, huge chunks of the genome are made up of mysterious virus-like entities called retrotransposons, pieces of selfish DNA that appear to serve no function other than to make copies of themselves. These account for no less than 34 per cent of our genome.
All in all, the virus-like components of the human genome amount to almost half of our DNA. This would once have been dismissed as mere "junk DNA", but we now know that some of it plays a critical role in our biology. As to the origins and function of the rest, we simply do not know.
The human genome therefore presents us with a paradox. How does this viral DNA come to be there? What role has it played in our evolution, and what is it doing to our physiology? To answer these questions we need to deconstruct the origins of the human genome - a story more fantastic than anything we previously imagined, with viruses playing a bigger part than you might care to believe.
Around 15 years ago, when I was researching my book Virus X, I came to the conclusion there was more to viruses than meets the eye. Viruses are often associated with plagues - epidemics accompanied by great mortality, such as smallpox, flu and AIDS. I proposed that plague viruses also interact with their hosts in a more subtle way, through symbiosis, with important implications for the evolution of their hosts. Today we have growing evidence that this is true (New Scientist, 30 August 2008, p 38), and overwhelming evidence that viruses have significantly changed human evolution.
Symbiosis was defined by botanist Anton de Bary in 1878 as the living together of dissimilar organisms. The partners are known as symbionts and the sum of the partnership as the holobiont. Types of symbiotic relationships include parasitism, where one partner benefits at the expense of the other, commensalism, where one partner profits without harming the other, and mutualism, in which both partners benefit.
Symbiotic relationships have evolutionary implications for the holobiont. Although selection still operates on the symbionts at an individual level since they reproduce independently, it also operates at partnership level. This is most clearly seen in the pollination mutualisms involving hummingbirds and flowers, where the structure of flower and bill have co-evolved to accommodate each other and make a perfect fit. When symbiosis results in such evolutionary change it is known as symbiogenesis.
Viruses as partners
Symbiosis works at many different levels of biological organisation. At one end of the spectrum is the simple exchange of metabolites. Mycorrhizal partnerships between plant roots and fungi, which supply the plant with minerals and the fungus with sugars, are a good example. At the other end are behavioural symbioses typified by cleaning stations where marine predators line up to have their mouths cleared of parasites and debris by fish and shrimps.
Symbiosis can also operate at the genetic level, with partners sharing genes. A good example is the solar-powered sea slug Elysia chlorotica, which extracts chloroplasts from the alga it eats and transfers them to cells in its gut where they supply the slug with nutrients. The slug's genome also contains genes transferred from the alga, without which the chloroplasts could not function. The slug genome can therefore be seen as a holobiont of slug genes and algal genes.
This concept of genetic symbiosis is crucial to answering our question about the origin of the human genome, because it also applies to viruses and their hosts. Viruses are obligate parasites. They can only reproduce within the cells of their host, so their life cycle involves forming an intimate partnership. Thus, according to de Bary's definition, virus-host interactions are symbiotic.
Genetic symbiosis is crucial to understanding the origin of the human genome, because it also applies to viruses
For many viruses, such as influenza, this relationship is parasitic and temporary. But some cause persistent infections, with the virus never leaving the host. Such a long-term association changes the nature of the symbiosis, making the evolution of mutualism likely. This process often follows a recognisable progression I have termed "aggressive symbiosis".
An example of aggressive symbiosis is the myxomatosis epidemic in rabbits in Australia in the 1950s. The European rabbit was introduced into Australia in 1859 as a source of food. Lacking natural predators, the population exploded, leading to widespread destruction of agricultural grassland. In 1950, rabbits infected with myxoma virus were deliberately released into the wild. Within three months, 99.8 per cent of the rabbits of south-east Australia were dead.
In 1950, rabbits infected with myxoma virus were released into the wild. Within three months 99.8 per cent of rabbits in south-east Australia were dead
Although the myxomatosis epidemic was not planned as an evolutionary experiment, it had evolutionary consequences. The myxoma virus's natural host is the Brazilian rabbit, in which it is a persistant partner causing no more than minor skin blemishes. The same is now true of rabbits in Australia. Over the course of the epidemic the virus selected for rabbits with a minority genetic variant capable of surviving infection. Plague culling was followed by co-evolution, and today rabbit and virus coexist in a largely non-pathogenic mutualism.
Now imagine a plague virus attacking an early human population in Africa. The epidemic would have followed a similar trajectory, with plague culling followed by a period in which survivors and virus co-evolved. There is evidence that this happened repeatedly during our evolution, though when, and through what infectious agents, is unknown (Proceedings of the National Academy of Sciences, vol 99, p 11748).
Even today viral diseases are changing the course of human evolution. Although the plague culling effect is mitigated by medical intervention in the AIDS pandemic, we nevertheless observe selection pressure on humans and virus alike. For example, the human gene HLA-B plays an important role in the response to HIV-1 infection, and different variants are strongly associated with the rate of AIDS progression. It is therefore likely that different HLA-B alleles impose selection pressure on HIV-1, while HLA-B gene frequencies in the population are likely to be influenced by HIV (Nature, vol 432, p 769). This is symbiogenesis in action.
How does that move us closer to understanding the composition of the human genome? HIV-1 is a retrovirus, a class of RNA virus that converts its RNA genome into DNA before implanting it into host chromosomes. This process, known as endogenisation, converts an infectious virus into a non-infectious endogenous retrovirus (ERV). In humans, ERVs are called HERVs.
Germline invaders
Endogenisation allows retroviruses to take genetic symbiosis to a new level. Usually it is an extension of the normal infectious process, when a retrovirus infects a blood cell, such as a lymphocyte. But if the virus happens to get incorporated in a chromosome in the host's germ line (sperm or egg), it can become part of the genome of future generations.
Such germ-line endogenisation has happened repeatedly in our own lineage - it is the source of all that viral DNA in our genome. The human genome contains thousands of HERVs from between 30 and 50 different families, believed to be the legacy of epidemics throughout our evolutionary history. We might pause to consider that we are the descendents of the survivors of a harrowing, if brutally creative, series of viral epidemics.
Endogenisation is happening right now in a retroviral epidemic that is spreading among koalas in Australia. The retrovirus, KoRv, appeared about 100 years ago and has already spread through 75 per cent of the koala's range, culling animals on a large scale and simultaneously invading the germ line of the survivors.
Retroviruses don't have a monopoly on endogenisation. Earlier this month researchers reported finding genes from a bornavirus in the genomes of several mammals, including humans, the first time a virus not in the retrovirus class has been identified in an animal genome. The virus appears to have entered the germ line of a mammalian ancestor around 40 million years ago (Nature, vol 463, p 84). Many more such discoveries are anticipated, perhaps explaining the origin of some of that mysterious half of the genome.
The ability of viruses to unite, genome-to-genome, with their hosts has clear evolutionary significance. For the host, it means new material for evolution. If a virus happens to introduce a useful gene, natural selection will act on it and, like a beneficial new mutation, it may spread through the population.
Could a viral gene really be useful to a mammal? Don't bet against it. Retroviruses have undergone a long co-evolutionary relationship with their hosts, during which they have evolved the ability to manipulate host defences for their own ends. So we might expect the genes of viruses infecting humans to be compatible with human biology.
This is also true of their regulatory DNA. A virus integrating itself into the germ line brings not just its own genes, but also regulatory regions that control those genes. Viral genomes are bookended by regions known as long terminal repeats (LTRs), which contain an array of sequences capable of controlling not just viral genes but host ones as well. Many LTRs contain attachment sites for host hormones, for example, which probably evolved to allow the virus to manipulate host defences.
Retroviruses will often endogenise repeatedly throughout the host genome, leading to a gradual accumulation of anything up to 1000 ERVs. Each integration offers the potential of symbiogenetic evolution.
Once an ERV is established in the genome, natural selection will act on it, weeding out viral genes or regulatory sequences that impair survival of the host, ignoring those that have no effect, and positively selecting the rare ones that enhance survival.
Most ERV integrations will be negative or have no effect. The human genome is littered with the decayed remnants of such integrations, often reduced to fragments, or even solitary LTRs. This may explain the origin of retrotransposons. These come in two types: long and short interspersed repetitive elements (LINEs and SINEs), and it now appears likely that they are heavily degraded fragments of ancient viruses.
As for positive selection, this can be readily confirmed by looking for viral genes or regulatory sequences that have been conserved and become an integral part of the human genome. We now know of many such sequences.
The first to be discovered is the remnant of a retrovirus that invaded the primate genome a little less than 40 million years ago and gave rise to what is known as the W family of ERVs. The human genome has roughly 650 such integrations. One of these, on chromosome 7, contains a gene called syncytin-1, which codes for a protein originally used in the virus's envelope but now critical to the functioning of the human placenta. Expression of syncytin-1 is controlled by two LTRs, one derived from the original virus and another from a different retrovirus called MaLR. Thus we have a quintessential viral genetic unit fulfilling a vitally important role in human biology.
Virus genes
There are many more examples. Another gene producing a protein vital to the construction of the placenta, syncytin-2, is also derived from a virus, and at least six other viral genes contribute to normal placental function, although their precise roles are poorly understood.
There is also tentative evidence that HERVs play a significant role in embryonic development. The developing human embryo expresses genes and control sequences from two classes of HERV in large amounts, though their functions are not known (Virology, vol 297, p 220). What is more, disrupting the action of LINE retrotransposons by administration of the drug nevirapine causes an irreversible arrest in development in mouse embryos, suggesting that LINEs are somehow critical to early development in mammals (Systems Biology in Reproductive Medicine, vol 54, p 11).
It also appears that HERVs play important roles in normal cellular physiology. Analysis of gene expression in the brain suggests that many different families of HERV participate in normal brain function. Syncytin-1 and syncytin-2, for example, are extensively expressed in the adult brain, though their functions there have yet to be explored.
Other research groups have found that 25 per cent of human regulatory sequences contain viral elements, prompting suggestions that HERVs make a major contribution to gene regulation (Trends in Genetics, vol 19, p 68). In support of that, HERV LTRs have been shown to be involved in the transcription of important proteins. For example, the beta-globin gene, which codes for one of the protein components of haemoglobin, is partly under the control of an LTR derived from a retrovirus.
The answer to our paradox is now clear: the human genome has evolved as a holobiontic union of vertebrate and virus. It is hardly surprising that researchers who have made these discoveries are now calling for a full-scale project to assess the contribution of viruses to our biology (BMC Genomics, vol 9, p 354).
It is also probable that this "virolution" is continuing today. HIV belongs to a group of retroviruses called the lentiviruses. Until recently virologists thought that lentiviruses did not endogenise, but now we know that they have entered the germ lines of rabbits and the grey mouse lemur. That suggests that HIV-1 might have the potential to enter the human germ line (Proceedings of the National Academy of Sciences, vol 104, p 6261 and vol 105, p 20362), perhaps taking our evolution in new and unexpected directions. It's a plague to us - but it could be vital to the biology our descendants.
http://www.newscientist.com/article/mg20527451.200-i-virus-why-youre-only-half-human.html?full=true
Thursday, 17 December 2009
How Long are your Telomeres?
One can think of the telomere as the aglet on the end of a shoe lace, the cap on the end of a length of DNA to stop it getting any longer - it's what stops chromosome legs linking up end to end. The region of DNA on either end of a chromosome is actually a length of G-rich DNA with the pattern TTAGGG repeating over and over again for a length of 5-15kb, then finishing off with a long G tail of 50 - 300b. A protein complex, sheltrin, binds to this section of DNA and helps the legth of DNA form a T loop which is a bit like a knot at the end of a shoe lace.
It is generally well known in the scientific community that the length of the telomere is related to the life span of a cell or multicellular organisms - regulating how many times cells can replicate before they stop dividing, go into senescence and eventually die. Generally a human cell will go through 40-60 cycles or thereabouts (the Hayflick limit). The length of telomeres reduces with every-other revolution of the cell cycle and it gets to critical length before the cell stops dividing.
This year's (2009) nobel prize in phyisology/medicine was awarded to Elizabeth H. Blackburn, Carol W. Greider and Jack W. Szostak for discovering how chromosomes are protected by telomeres and the enzyme telomerase. One of their key findings was that telomeres are not shortened with every cell cycle.
It is not known exactly to what level the length of telomeres or the rate of erosion affects people's life expectancy. The general rule is that if you start off with long telomeres and/or telomere shortening is slow, then your life expectancy will be longer. But life expectancy and longevity has multiple factors governing it.
Telomerase is the enzyme that keeps the telomere long. Some species even produce enough to lengthen their telomeres with every-other cycle so they live longer lives but they still die eventually. Telomerase and long telomeres may play a larger role in slowing the ageing process rather than extending life.
Because telomere erosion occurs consistantly during the cell cycle it can be used as a measurement of age of a species where it is difficult to predict the age of an animal. And intraspecies, the telomere lengths and rate of erosion is quite different. Compared to many animals the homo sapien has a really long life considering that human telomeres are only 8–12 kb in length or less at birth and reduce 2 - 4kb in a lifetime. The rate of change of telomere erosion plays a role but looking at the graph below it is not so clear cut if long telomere lengths extend life, especially when compared with our closest living relative, (Pan) the chimpanzee. And also, one wonders about trees - some trees are thousands of years old and still going! (see B.E. Falanary paper below)

The role of telomeres is to maintain the chromosome package. In bacteria mutations are not a big deal but in higher organisms they can cause serious harm and even lead to cancer. The hayflick limit governed by telomere length is seen as a safety mechanism which limits the number of cycles a cell goes through in order to make sure the quality of the DNA does not deteriorate, because with age comes an increased risk of mutation and cancer development.
Recommended papers:
C.M. Vleck et al., 2003, The natural history of telomeres: tools for aging animals and exploring the aging process.
Y. Deng et al., 2008, Telomere dysfunction and tumour suppression: the senescence connection
P.M. Lansdorp, 2009, Telomeres and disease
B.E. Flanary, 2005, Analysis of telomere length and telomerase activity in tree species of various
life-spans, and with age in the bristlecone pine Pinus longaeva
T.J. Vulliami, 2009, Premature Aging
C. Auriche et al., 2007, Budding yeast with human telomeres: A puzzling structure
Wednesday, 7 October 2009
Avoid Boring People - Avoid James Watson
I read this book, James Watson's autobiography, a few days ago. It was not very good, and it didn't particularly raise my respect for James Watson as a person or a scientist. The format of the book is the best thing about it - each chapter represents a stage in his life and at the end of each chapter a list of lessons and advice with a short explanation is given. Some of the advice is pretty good - if you plan to look through this book then just read the lessons that are relevant to you - no point in reading the whole thing...
James Watson's autobiography makes it blatantly clear that he did not struggle very hard to get where he is. He and his parents were middle class and he did not have any emotional or money trouble. He didn't have any family problems. He was intelligent and did well in school and met all the required grades to get into the colleges and universities and he never had any problems funding any of this. He never failed at anything he did and was never forced to take any unsavoury career routes or make any large compromises. Also he barely mentions anything about politics or the state of the nation and its affects on him.
He was always in education or in employment and he always had something to do and everyone around him supported him. The universities he worked for paid him well and funded all his research without debate (and this was before he discovered the structure of DNA). His choice of research was in the best field of its time (and now) - genetics. He gives advice concerning this along the lines of: Pick a research subject on the cutting edge of science. He says that picking a subject that has been studied extensively already or has little to contribute to humanity is not beneficial to an active scientist's career. I agree with him and can think of quite a few fields in science where funding should be cut off. But the point he was making is that if you plan to be a highly successful researcher, discover lots of new things and perhaps get recognition and/or a nobel prize, look at a subject where you explore uncharted territory that is beneficial to man (and could make you money).
As for his personal life, what can I say except that he never divulges anything about personal relationships with women or even his close friends (if he even had really close friends). He never mentions a single disagreement or serious debate he had with any person either personal or professional. He met with some of the most eminent scientists like Renato Dulbecco, Linus Pauling, Rosalind Franklin and a few others I've forgotten the names of, but he never mentions their personalities, quirks or how he interacted with them. Going back to his dealing with women... what women? He did not form any deep relationships with individuals and I think he remained a virgin till marriage or something - heck, he doesn't mention anything concerning emotions or attractions. He never had "women trouble" and he married one of his assistants because "when she wasn't there he missed her presence". He doesn't mention the courtship period or his anxieties or emotional strife (if any). Lazy & Boring.
The genious behind the discovery of the DNA double helix structure was their using templates, initially paper cutouts, to fit together and see if the distances between the atoms in the molecules were acceptable. The fact that they used cutouts and arranged them in different ways is how they got to the answer. Sometimes visualisation is the key to discovery - imagine a large table of numerical values - as a table it might not look like there is a pattern in the data but once you plot it on a chart/graph and perhaps carry out a function on it, then you can discover a pattern and even model it. You would find it very hard to discover a trend just dealing with a list of numbers.
Nothing else is particularly interesting after that. He doesn't make any massive blunders, nor has he any serious problems. Yeh, he won a nobel prize which he was happy to get but a nobel comes quite a few years after the discovery and the scientist has moved on to a different challenge by then. He was happy to receive it but not extatic.
In summary, it's not a good book if you're looking to find out about his personality or personal life - he keeps his cards very close to his chest. He doesn't insult or mock anyone or bring up arguments or debates - either he is very cold emotionally or he wants to play it very safe. He doesn't describe himself, other people in character, quirks or appearance to make it interesting. The most controversial thing he ever did happened later on in his life a long time after the publication of this book when he said some things people saw as very racist (see here). If you do plan to look this book up there is no point in reading it fully - look at the photos and read the points at the end of each chapter instead. James Watson decided to call his autobiography "Avoid Boring People" and I agree - avoid James Watson.
I'm currently reading "Next" by Michael Crichton, a fiction book concerning genetics and its misuse. So far so good - I might write a post about this later. I might look at a Linus Pauling book next.
Thursday, 29 January 2009
Understanding Microarrays
Part 1
We'll use the experiment mentioned in the above multimedia example... Yeast cells can grow with or without oxygen. But in order to survive-in or adapt-to these conditions they have to create new proteins and also stop the production of other proteins that are not so useful in that condition.
If we place some cells in one condition (with oxygen) and then extract the mRNA from them we can tell which proteins are being made. Cells in oxygen is our control condition.
If we extract the mRNA from the cells in the other condition (sans oxygen) then we know what proteins are being expressed for this specific state. Cells without oxygen is the experimental condition.
If we compare the proteins being made (or not being made) in the two conditions we can discover what transcription has started or stopped in the anaerobic state.
We use DNA microarrays to do find this out. The microarray chip (glass slide) contains the mRNA of the whole yeast genome attached to it. To make a microarray we need to get the mRNA strands from both samples and use them to make mRNA probes which are attached to the surface of the microarray. This is a long process which we won't go into. Often, a ready-to-use microarray chip is available to buy from a company like Affymetrix.
Part 2
In order to find out which proteins have been transcribed we have to attach the mRNA of the samples in each condition to the microarray and we have to label them in a way to identify which mRNA came from which sample.
Since both the microarray mRNA probes and the mRNA strands from the cells are the same they cannot combine so we have to make complimentary DNA (cDNA) strands from the mRNA of each sample.
The enzyme reverse transcriptase converts the mRNA to cDNA. The cDNA is made with flourescently-labelled nucleotides so under the correct light the cDNA will glow. The cDNA has the complementary sequence of the mRNA, so if the mRNA was as shown below, thecDNA would be:
CUUUUUAUCCCCCGGGC - mRNA
GAAAAATAGGGGGCCCG - cDNA
Sample 1, the control, in aerobic conditions is labelled green. The second sample, the experimental condition, anaerobic, is labelled with a red flourescent. The mRNA is dissolved using RNAse so we end up with pure cDNA.
The red and green cDNA is complementary to the mRNA of the microarray so when it is squirted on to the microarray slide from both samples they quickly bind to their complimentary strands. Anything that didn't attach is washed off.
Part 3
The microarray is scanned using a machine that has two lasers that iduce flourescence from red and green labelled strands. Pictures for each color are stored on the computer and processed to measure the intensity of the flourescence - the greater the intensity, the more cDNA is attached to the probes and this tells us that a particular gene is highly expressed. Or the intensity is really weak so we can tell that that particular gene is barely expressed.The pictures/data can be combined to compare both conditions:
If a gene was expressed only in the control cells then a spot on the microarray would glow green.
If a gene was expressed only in the experimental cells (anaerobic) then a spot on the microarray would glow red.
If the gene was expressed in both conditions the colors green and red would mix to form a yellowy shade.
Genes that aren't expressed in either condition show as black since no light is emitted.
Since each gene of the yeast genome is a spot on the microarray, we know what gene each color-spot represents on the microarray so we can easily find out which genes are induced or repressed after the scan.
Simple sort-of :) . Data retrieved from microarray analysis is usually processed using complex programs like R using the bioconductor module. There's a lot of statistics behind the analysis of the data so most people who deal with this stuff are specialists.
Thursday, 1 January 2009
Primer Summary
Tuesday, 16 December 2008
Primer 2: Genomes
Prokaryotic-celled species such a bacteria have a single long strand of DNA floating freely in the cytoplasm of the cell. Viruses have an RNA (occasionally DNA) genome floating in their capsule (head).
Eukaryotic-celled species such as mammals have numerous chromosomes with DNA wound onto them. Not all the DNA is useful - in humans only 5% of 3 Billion bases codes for proteins! The coding parts are called Exons. The rest of the DNA is often called "junk DNA" but some people think it still plays an role in the cell but they aren't exactly sure what. Non-coding parts are called Introns.
The first genome to ever get sequenced was that of the virus, a Bacteriophage, a virus that infects bacteria. It had an RNA genome of 3569 bases. Viruses are actually not living organisms but instead are a structure made of proteins in which some genetic material is stored in the head, and when the virus manages to latch on to a cell, it creates a pore in the cell membrane and inserts its RNA genome, which the host cell unwittingly takes and replicates to create more viruses. Viruses tend to cause cell death and the cell can explode or bud-off packages to release the virus copies. No-one is really sure where viruses came from or how they came to be.
The first genome of a living organism to be sequenced was that of the bacteria Haemophilus influenzae in 1995, done using the Shotgun method, as a proof of concept. It had 1.83 Million bases (Mb) on one chromosome!
The shotgun method involves cutting up the whole genome using a restriction enzyme (an enzyme that cuts DNA when it finds a specific base sequence) and then analysing the pieces and then sticking them back together like a puzzle. The reading of the bases and the puzzle solving is done by computers. For example, say we cut the genome and we get lots of strands that are similar to each other but overlap each other - we line them up like so:
ATCCCGGATGCTCTG
-----------ATGCTCTGAAAAAATTCCCCCC
The hyphens represent spaces, to fit the overlap, in order to align the sequences. The first fragment shares some of the code from the second fragment so we assume they come from the same place and so in reality the original DNA strand contains the code ATCCCGGATGCTCTGAAAAAATTCCCCCC. The computer does this lots of times with lots of fragments and eventually everything lines up and it publishes the final sequence.
c.f. Eric D. Green (2001); STRATEGIES FOR THE SYSTEMATIC SEQUENCING OF COMPLEX GENOMES; Nature Reviews: Genetics, Aug 2001 Vol2. p573 (here)
What's amazing is that the process is repeated over and over to fill any inconsistancies and gaps in the data and to validate the original sequencing. Some parts of the human genome were repeatedly sequenced up to 12 times. When a draft sequence is published it is sometimes possible to find an X nucleotide base in the middle of a sequence - this signifies that particular base was either unclear or the data was not collected successfully. It is a long and laborious endevour but well worth it as it has led to the genomics revolution which we are currently at the start of - you will soon see so many discoveries thanks to the availability of the genomes of humans and model organisms such as mice, zebra fish, nematode worms, drosophlia flies and the Arabidopsis plant.
The first draft of the human genome was published in late 2000 and then the final draft was published in 2003. It was dicovered the human genome had between 25000 and 30000 genes which was much less than anyone had predicted since even simpler organisms like the rice plant has about 30000 genes (but a smaller genome). Finally, take a look at the table of species that have been sequenced and look at how huge the genome of the single-celled Amoeba is!!! (here)
A good resource, some slides, although slightly innacurate, can be seen here.