Ahmad's Blog

Thursday, 29 January 2009

Understanding Microarrays

Ever tried to understand DNA Microarrays and how they work? Well, here is a brilliant multimedia animation to explain the process:

http://gcat.davidson.edu/Pirelli/index.htm

Microarrays are used to measure the gene expression of cells in different conditions. When a cell becomes cancerous, for example, some genes are induced (transcription increases), other genes are repressed (transcription is decreased) and with other genes nothing changes. Cells respond to different conditions, some environmental (e.g. sun burn) and some chemically induced (e.g. taking heroin). Medicines and their affects on gene expression are prime candidates for micro-array analysis so they can find out why people respond differently to the same drug - which genes are up-regulated/down-regulated in the presence of the drug.

Part 1

We'll use the experiment mentioned in the above multimedia example... Yeast cells can grow with or without oxygen. But in order to survive-in or adapt-to these conditions they have to create new proteins and also stop the production of other proteins that are not so useful in that condition.

If we place some cells in one condition (with oxygen) and then extract the mRNA from them we can tell which proteins are being made. Cells in oxygen is our control condition.
If we extract the mRNA from the cells in the other condition (sans oxygen) then we know what proteins are being expressed for this specific state. Cells without oxygen is the experimental condition.

If we compare the proteins being made (or not being made) in the two conditions we can discover what transcription has started or stopped in the anaerobic state.

We use DNA microarrays to do find this out. The microarray chip (glass slide) contains the mRNA of the whole yeast genome attached to it. To make a microarray we need to get the mRNA strands from both samples and use them to make mRNA probes which are attached to the surface of the microarray. This is a long process which we won't go into. Often, a ready-to-use microarray chip is available to buy from a company like Affymetrix.

Part 2

In order to find out which proteins have been transcribed we have to attach the mRNA of the samples in each condition to the microarray and we have to label them in a way to identify which mRNA came from which sample.

Since both the microarray mRNA probes and the mRNA strands from the cells are the same they cannot combine so we have to make complimentary DNA (cDNA) strands from the mRNA of each sample.

The enzyme reverse transcriptase converts the mRNA to cDNA. The cDNA is made with flourescently-labelled nucleotides so under the correct light the cDNA will glow. The cDNA has the complementary sequence of the mRNA, so if the mRNA was as shown below, thecDNA would be:

CUUUUUAUCCCCCGGGC - mRNA
GAAAAATAGGGGGCCCG - cDNA

Sample 1, the control, in aerobic conditions is labelled green. The second sample, the experimental condition, anaerobic, is labelled with a red flourescent. The mRNA is dissolved using RNAse so we end up with pure cDNA.

The red and green cDNA is complementary to the mRNA of the microarray so when it is squirted on to the microarray slide from both samples they quickly bind to their complimentary strands. Anything that didn't attach is washed off.

Part 3

The microarray is scanned using a machine that has two lasers that iduce flourescence from red and green labelled strands. Pictures for each color are stored on the computer and processed to measure the intensity of the flourescence - the greater the intensity, the more cDNA is attached to the probes and this tells us that a particular gene is highly expressed. Or the intensity is really weak so we can tell that that particular gene is barely expressed.The pictures/data can be combined to compare both conditions:

If a gene was expressed only in the control cells then a spot on the microarray would glow green.
If a gene was expressed only in the experimental cells (anaerobic) then a spot on the microarray would glow red.
If the gene was expressed in both conditions the colors green and red would mix to form a yellowy shade.
Genes that aren't expressed in either condition show as black since no light is emitted.

Since each gene of the yeast genome is a spot on the microarray, we know what gene each color-spot represents on the microarray so we can easily find out which genes are induced or repressed after the scan.

Simple sort-of :) . Data retrieved from microarray analysis is usually processed using complex programs like R using the bioconductor module. There's a lot of statistics behind the analysis of the data so most people who deal with this stuff are specialists.

Friday, 9 January 2009

Shaking the foundations of the central dogma (again)

The first shock that came to the creators of the central dogma (Watson & Crick) was the discovery of retroviruses which had a little trick of reverse transcribing their RNA genome into DNA. This was a shock because it was believed that the pathway of DNA to Protein occured in one direction and could not be reversed.

Now a new discovery has shaken the foundations yet again. It has always been believed that three nucleotide bases in DNA, a triplet codon, code for one type of Amino Acid... until now! A marine bacteria Euplotes crassus was discovered where one triplet (UGA) can code for either cysteine or selenocysteine. How the bacterial genome regulates which AA is coded is not clear but it is something amazing.

Read more here: http://scienceblogs.com/notrocketscience/2009/01/one_codon_two_amino_acids_the_genetic_code_has_a_shift_key.php

Edit (30-01-2009): Apprently the two-amino-acids-for-one-triplet-codon thing is not a new phenomena. I just found this in Concepts of Genetics by Klug et al., Ch. 14.6, p361 -

"Only one codon, AUG, codes for methionine, and it is sometimes called the initiator codon. However, when AUG appears internally in mRNA, rather than at an intiating position, unformulated methionine is inserted into the polypeptide chain (instead of the initiator type of methionine, N-formylmethionine (fmet)). Rarely, another codon (in Bacteria), GUG, specifies methionine during initiation, though it is not clear why this happens."

Furthermore, in transcribed mitochondrial DNA (mtDNA), mtRNA, there is another special behavior:

"In human mitochondria, AUA, which normally specifies isoleucine, directs the internal insertion of methionine. In yeast mitochondria, threonine is inserted instead of leucine when CUA is encountered in mRNA..."

In fact, there turned out to be many such examples over time. See table 14.5 on page 362.

It just goes to prove that whatever rule exists there is almost always something that breaks it. It also brings home the complexity and heterogeneity found in biological systems.

Thursday, 1 January 2009

Primer Summary

If you bothered to read the primers I wrote before, I salute you. However, I found a better resource that will present the subject much better and it has nice pictures with large text, animations and more! Someone clearly had a lot of time on their hands. It covers Classical Genetics, Molecules of Genetics and Genetic Organization and Control. Here it is: DNA From The Beginning.

Sunday, 28 December 2008

Software: Amino Acid Learning Aid (AALA)

I quickly wrote a little Java program that will help people learn their Amino Acids and it's called Amino Acid Learning Aid (AALA). It's pretty crude but it should be helpful. Remember, everything I write is Public Domain which means you can do what you want with it and I don't own it.

Pretty much everything you need to know about Amino Acids is layed out on this wiki page from where I also got the pictures: http://en.wikipedia.org/wiki/List_of_standard_amino_acids

Knowing the Amino Acids and their codes is to biologists what knowing the Periodic Table is to chemists but far more important ;p.

And to start the program just run the executable JAR (AALA.jar) in the dist folder. Download it here:

http://www.electronicocean.com/bioinformatics/aminoacids.zip

Thursday, 25 December 2008

My PC's Setup

First, I'll describe the hardware of my local desktop system:

AMD 64 dual core 6000+ (3GHz)
1.5GB DDR2 RAM
250GB SATA HDD + 20GB IDE slave HDD
ATI Radeon X1200 onboard Gfx card
Windows XP Pro on master

Software on local desktop:

WAMP with Apache 2, PHP 5 and MySQL 5 - see how it is setup properly here.
Netbeans 6.5 with Java 1.6.0_11. Get it here.
Linux Ubuntu 8.04 running on Virtual Machine (Sun Virtualbox)

I also have a webserver at my disposal (ish) but that's my business ;)

Anyway, the reason I'm running Linux on a VM rather than on a seperate disk or partition is because it tries to install on the IDE drive and ignores the SATA master when I try to install it. I use Linux for my Perl scripting since its installed on Linux by default - until I have a good reason to install Perl on Windows I will continue my Perl hacking on Linux.

Saturday, 20 December 2008

Primer 3 - The Central Dogma

In molecular biology the process, the way by which DNA is read and proteins are made, is called "The Central Dogma". The usual pathway is a two step procedure:

DNA --> RNA --> Protein

Step 1: Transcription

So you start off with DNA floating inside the nucleus in the middle of the cell. An enzyme called RNA polymerase comes to the double stranded DNA, unwinds it and breaks the weak hydrogen bonds between the bases of each strand. It basically unzips it. This process is similar to DNA duplication except a different enzyme is involved there - DNA polymerase - and it bind complementary DNA free-floating nucleotides (A/T/C/G) to bind to the sense strand (master template). The master template is the top half of the unzipped double strand and it goes in the 5' - 3', left to right direction. By this process it doubles the DNA before cell replication so that the new cell gets a copy of the entire genome.

RNA polymerase partially unwinds and unzips the DNA to bind complimentary RNA (A/U/C/G) Nucleotides to the sense strand. The enzyme binds to the start codon in the DNA, travels along the strand and dettaches when it gets to a stop codon A single RNA strand is forged as the enzyme moves along the strand and when the enzyme dettaches the free RNA strand, now called messenger RNA (mRNA) , begins to float outside of the nucleus. Outside the Nucleus it eventually finds its way to a protein complex called a Ribosome where step 2, translation, occurs.

Step 2: Translation

So the mRNA strand which is floating in the plasma of the cell reaches a Ribosome. The ribosome is shaped a bit like a grasped hand and the RNA strand goes through it three bases at a time. Free floating anticodon triplets in a special molecule called Aminoacetylated Transfere RNA (tRNA) molecules in the cytoplasm come to the ribosome. These tRNA molecules have a triplet of bases at one end and an amino-acid attached at the other end. So the anti-codon triplet on the tRNA binds to the RNA strand in a complementary fashion; in fact three tRNA molecules are docked into the ribosome at a time and the amino acids on the other end of the tRNA join up together and this process continues for the length of the RNA strand and the amino acids continue to join up and form a long strand - a polypeptide (protein) strand.

Triplet codons

Each triplet of bases (called a codon) of the DNA/RNA strand represents an amino acid. For example, AUG codes for the amino acid Methionine. AUG (ATG in DNA) is the start codon. There are 64 possible combinations that can be made from a triplet of bases since there are 4 different bases (4 * 4 * 4 = 64). But yet there are 20 possible types of amino acids so this means that some triplets code for the same amino acid like CUS, CCS, CAS and CGS will all code for the amino acid Serine.

Promoter Sequences

Consider the start codon ATG in DNA. There is a 1/64 probability by chance alone that we find it while going along the DNA strand. It can't be that every time the ATG codon is found by RNA polymerase it starts transcription. No, in order to identify which occurance of the ATG codon signifies the beginning of a gene a special sign needs to be present - this is the "Promoter Sequence". Only when the promoter sequence is found a few bases before the start tag will the transcription of that length of DNA occur. In prokaryotes the promoter lies about -10 or -35 bases upstream of the start codon while in eukaryotes it is far more complex but in approxiamatly 20% of the cases a TATA code is found in it. There are various promoters and there can be very many bases upstream of a gene before a promoter is found.

Tuesday, 16 December 2008

Primer 2: Genomes

A genome is the collective genetic material, the DNA or RNA, of an organism. It is all the Nuclotide bases found in DNA/RNA - A(T/U)CG - in the order found in the organism. All organisms on earth use DNA/RNA and if they can be used between species - it's all compatible - you could stick mouse DNA into a fly's genome or even a human's genome and it would do something at least. Almost all cells in the human body have the full set of chromosomes in them - the whole genome.

Prokaryotic-celled species such a bacteria have a single long strand of DNA floating freely in the cytoplasm of the cell. Viruses have an RNA (occasionally DNA) genome floating in their capsule (head).

Eukaryotic-celled species such as mammals have numerous chromosomes with DNA wound onto them. Not all the DNA is useful - in humans only 5% of 3 Billion bases codes for proteins! The coding parts are called Exons. The rest of the DNA is often called "junk DNA" but some people think it still plays an role in the cell but they aren't exactly sure what. Non-coding parts are called Introns.

The first genome to ever get sequenced was that of the virus, a Bacteriophage, a virus that infects bacteria. It had an RNA genome of 3569 bases. Viruses are actually not living organisms but instead are a structure made of proteins in which some genetic material is stored in the head, and when the virus manages to latch on to a cell, it creates a pore in the cell membrane and inserts its RNA genome, which the host cell unwittingly takes and replicates to create more viruses. Viruses tend to cause cell death and the cell can explode or bud-off packages to release the virus copies. No-one is really sure where viruses came from or how they came to be.

The first genome of a living organism to be sequenced was that of the bacteria Haemophilus influenzae in 1995, done using the Shotgun method, as a proof of concept. It had 1.83 Million bases (Mb) on one chromosome!

The shotgun method involves cutting up the whole genome using a restriction enzyme (an enzyme that cuts DNA when it finds a specific base sequence) and then analysing the pieces and then sticking them back together like a puzzle. The reading of the bases and the puzzle solving is done by computers. For example, say we cut the genome and we get lots of strands that are similar to each other but overlap each other - we line them up like so:

ATCCCGGATGCTCTG
-----------ATGCTCTGAAAAAATTCCCCCC

The hyphens represent spaces, to fit the overlap, in order to align the sequences. The first fragment shares some of the code from the second fragment so we assume they come from the same place and so in reality the original DNA strand contains the code ATCCCGGATGCTCTGAAAAAATTCCCCCC. The computer does this lots of times with lots of fragments and eventually everything lines up and it publishes the final sequence.

c.f. Eric D. Green (2001); STRATEGIES FOR THE SYSTEMATIC SEQUENCING OF COMPLEX GENOMES; Nature Reviews: Genetics, Aug 2001 Vol2. p573 (here)

What's amazing is that the process is repeated over and over to fill any inconsistancies and gaps in the data and to validate the original sequencing. Some parts of the human genome were repeatedly sequenced up to 12 times. When a draft sequence is published it is sometimes possible to find an X nucleotide base in the middle of a sequence - this signifies that particular base was either unclear or the data was not collected successfully. It is a long and laborious endevour but well worth it as it has led to the genomics revolution which we are currently at the start of - you will soon see so many discoveries thanks to the availability of the genomes of humans and model organisms such as mice, zebra fish, nematode worms, drosophlia flies and the Arabidopsis plant.

The first draft of the human genome was published in late 2000 and then the final draft was published in 2003. It was dicovered the human genome had between 25000 and 30000 genes which was much less than anyone had predicted since even simpler organisms like the rice plant has about 30000 genes (but a smaller genome). Finally, take a look at the table of species that have been sequenced and look at how huge the genome of the single-celled Amoeba is!!! (here)

A good resource, some slides, although slightly innacurate, can be seen here.

Monday, 15 December 2008

Primer 1 - Cell Biology and Macromolecules

So, from memory (mostly), these are the things you need to know:

Macromolecule - a whole load of the same type of molecule (a micro molecule) joined up together to form one huge molecule - lots of things in biology are this way.

Cells - The basic building-block of an organism (a living creature). Your body contains between 50 and 100 Trillion cells (src). Some organisms are a single cell - like bacteria and the amoeba. Cells start off from unspecialised cells called "stem" cells and then they differentiate into specialised cells that have a specific function like White Blood Cells or Liver cells or whatever.

Proteins - The basic building block of cells and most of what is inside your body. Proteins are long chains of a small molecules called Amino Acids all connected up. Proteins can be globular - they curl up into a ball-like lump - or they can be structural - they have an organised shape which is essential to their joint function. Most proteins are enzymes and quite a number of them only function in pairs (dimer), triplets (trimer) or quadruplets (quatromer).

Enzymes - Proteins that have a special function: to reduce the energy required for a reaction to occur. They have an active site, which is like a gap on their surface, which docks onto a target that more or less fits its shape. When on the target they reduce the strength of a bond and then some other molecule, usually water, shoots in and breaks the bond. Now the target molecule is in two halves. The breaking molecule hit the target by chance alone - there was nothing, no special force, to draw it in to carry out the reaction. Enzymes are pH and heat sensitive.

Amino Acids - AAs are a special molecule which are the building blocks of proteins. There are 20 different types of Amino Acids - you MUST learn how to draw the basic structure of an amino acid to be a biologist or a bioinformatician and it is highly recommended you learn them all off by heart - I'll help you do this in a later post. Amino Acids are called 'acids' due to their hydroxylic chemical group - CO₂H. But they are really amphipathic - they have both an acidic group (polar) and a basic (alkali - non-polar) group - NH₂. In the presence of a particular enzyme the basic group hooks up to the acidic group to form a peptide bond (CO-NH) in a condensation reaction (H₂O is released). Lots of these reactions form a chain - protein.

DNA - DeoxyriboNucleic Acid. Although you don't need to know how to draw its chemical structure it's fun to learn this off by heart. DNA is found in the nucleus of a cell. It carries all your genetic information (your genes) - the blue print of how to make proteins - how to build a whole organism. Almost all cells in your body have a full set of DNA. DNA is a very specific shape - a double helix. If you look at a picture of DNA (here) there is a small gap between the backbone and a bigger gap. These are called the minor and major grooves. DNA is wound into bundles called Chromosomes inside the nucleus. DNA has two strands that complement each other.

Chromosomes - Humans have 23 unique chromosomes. Different animals have varying numbers of chromosomes. A chromosome is a made of proteins called histones and the DNA threads are wound up on it like string is on a reel. Most human cells have 23 pairs of chromosomes (46) - they have the same DNA on both pairs. The first 22 chromosomes of the 23 individual chromosomes are numbered 1 to 22 while the other two are called X and/or Y. X and Y look like the letters X and Y hence their name and they are "sex chromosomes" - if you have an XX pair in your body you are female and if you have an XY pair you are male. The rest of the chromosomes are called "autosomes". The XY chromosomes form an odd couple - an unusual pairing which will be talked about in a future post.

Nucleotides - these are the building blocks of DNA. They are made up of a phosphate group, a pentose sugar and a base. The backbone is formed by the phosphate and the sugar. The important part is the base. There are 4 types of nucleotide bases Adenine, Thymine, Cytosine, Guanine (A, T, C, G). A ang G are Purines and T and C are Pyrimidines (don't ask). The bases point inwards from the backbone and they pair up with the other side specifically A-T and C-G. If you have one side having the code ATTCCGGA the other side will have TAAGGCCT.

RNA - RiboNucleic Acid. A lot like DNA except without that extra oxygen on the sugar backbone and that's why RNA doesn't form a double helix. RNA forms hydrogen bonds with the DNA bases in a complementary fashion except that instead of Thymine it has another base called Uracil - so A pairs up with U. DNA is used as a template in a copying process called transcription which involves RNA, now called mRNA (messenger RNA) carrying the code. mRNA is used to make proteins in a process called translation which involves another type of RNA called tRNA (transfer RNA). More about this in another post.

For more information read a textbook or wikipedia or just google it :)

Sunday, 14 December 2008

Bioinfornetics - Hello World

Hi! I'm Ahmad, nice to meet you. I live in the UK and have a Masters degree in Bioinformatics & Computational Biology and an undergraduate degree in Medical Sciences. It has been 3 months since I graduated (Sep 2008) yet I'm finding it really difficult finding a job. So while I'm waiting I will try to exercise my skills and keep my knowledge fresh and explore different parts of genetics using bioinformatics. Recently my interest in genetics has been ignited and I'm reading into it and learning lots of new things.

The cool thing is that you can see what I do, how I do it and the source code (usually Java, PHP and Perl), which I'm making public domain so you can copy, modify and do anything you want with it. If you find anything interesting please comment. I'm always seeking better ways to do things so if you have an easier or better way or you want to make a correction then please tell me about it.

Oh, and if you want to contact me my email is a_retha [AT] ymail.com