Sunday 28 December 2008

Software: Amino Acid Learning Aid (AALA)

I quickly wrote a little Java program that will help people learn their Amino Acids and it's called Amino Acid Learning Aid (AALA). It's pretty crude but it should be helpful. Remember, everything I write is Public Domain which means you can do what you want with it and I don't own it.

Pretty much everything you need to know about Amino Acids is layed out on this wiki page from where I also got the pictures: http://en.wikipedia.org/wiki/List_of_standard_amino_acids

Knowing the Amino Acids and their codes is to biologists what knowing the Periodic Table is to chemists but far more important ;p.


And to start the program just run the executable JAR (AALA.jar) in the dist folder. Download it here:
http://www.electronicocean.com/bioinformatics/aminoacids.zip

Thursday 25 December 2008

My PC's Setup

First, I'll describe the hardware of my local desktop system:
  • AMD 64 dual core 6000+ (3GHz)
  • 1.5GB DDR2 RAM
  • 250GB SATA HDD + 20GB IDE slave HDD
  • ATI Radeon X1200 onboard Gfx card
  • Windows XP Pro on master
Software on local desktop:
  • WAMP with Apache 2, PHP 5 and MySQL 5 - see how it is setup properly here.
  • Netbeans 6.5 with Java 1.6.0_11. Get it here.
  • Linux Ubuntu 8.04 running on Virtual Machine (Sun Virtualbox)
I also have a webserver at my disposal (ish) but that's my business ;)

Anyway, the reason I'm running Linux on a VM rather than on a seperate disk or partition is because it tries to install on the IDE drive and ignores the SATA master when I try to install it. I use Linux for my Perl scripting since its installed on Linux by default - until I have a good reason to install Perl on Windows I will continue my Perl hacking on Linux.

Saturday 20 December 2008

Primer 3 - The Central Dogma

In molecular biology the process, the way by which DNA is read and proteins are made, is called "The Central Dogma". The usual pathway is a two step procedure:

DNA --> RNA --> Protein

Step 1: Transcription

So you start off with DNA floating inside the nucleus in the middle of the cell. An enzyme called RNA polymerase comes to the double stranded DNA, unwinds it and breaks the weak hydrogen bonds between the bases of each strand. It basically unzips it. This process is similar to DNA duplication except a different enzyme is involved there - DNA polymerase - and it bind complementary DNA free-floating nucleotides (A/T/C/G) to bind to the sense strand (master template). The master template is the top half of the unzipped double strand and it goes in the 5' - 3', left to right direction. By this process it doubles the DNA before cell replication so that the new cell gets a copy of the entire genome.

RNA polymerase partially unwinds and unzips the DNA to bind complimentary RNA (A/U/C/G) Nucleotides to the sense strand. The enzyme binds to the start codon in the DNA, travels along the strand and dettaches when it gets to a stop codon A single RNA strand is forged as the enzyme moves along the strand and when the enzyme dettaches the free RNA strand, now called messenger RNA (mRNA) , begins to float outside of the nucleus. Outside the Nucleus it eventually finds its way to a protein complex called a Ribosome where step 2, translation, occurs.

Step 2: Translation

So the mRNA strand which is floating in the plasma of the cell reaches a Ribosome. The ribosome is shaped a bit like a grasped hand and the RNA strand goes through it three bases at a time. Free floating anticodon triplets in a special molecule called Aminoacetylated Transfere RNA (tRNA) molecules in the cytoplasm come to the ribosome. These tRNA molecules have a triplet of bases at one end and an amino-acid attached at the other end. So the anti-codon triplet on the tRNA binds to the RNA strand in a complementary fashion; in fact three tRNA molecules are docked into the ribosome at a time and the amino acids on the other end of the tRNA join up together and this process continues for the length of the RNA strand and the amino acids continue to join up and form a long strand - a polypeptide (protein) strand.

Triplet codons

Each triplet of bases (called a codon) of the DNA/RNA strand represents an amino acid. For example, AUG codes for the amino acid Methionine. AUG (ATG in DNA) is the start codon. There are 64 possible combinations that can be made from a triplet of bases since there are 4 different bases (4 * 4 * 4 = 64). But yet there are 20 possible types of amino acids so this means that some triplets code for the same amino acid like CUS, CCS, CAS and CGS will all code for the amino acid Serine.

Promoter Sequences

Consider the start codon ATG in DNA. There is a 1/64 probability by chance alone that we find it while going along the DNA strand. It can't be that every time the ATG codon is found by RNA polymerase it starts transcription. No, in order to identify which occurance of the ATG codon signifies the beginning of a gene a special sign needs to be present - this is the "Promoter Sequence". Only when the promoter sequence is found a few bases before the start tag will the transcription of that length of DNA occur. In prokaryotes the promoter lies about -10 or -35 bases upstream of the start codon while in eukaryotes it is far more complex but in approxiamatly 20% of the cases a TATA code is found in it. There are various promoters and there can be very many bases upstream of a gene before a promoter is found.

Tuesday 16 December 2008

Primer 2: Genomes

A genome is the collective genetic material, the DNA or RNA, of an organism. It is all the Nuclotide bases found in DNA/RNA - A(T/U)CG - in the order found in the organism. All organisms on earth use DNA/RNA and if they can be used between species - it's all compatible - you could stick mouse DNA into a fly's genome or even a human's genome and it would do something at least. Almost all cells in the human body have the full set of chromosomes in them - the whole genome.

Prokaryotic-celled species such a bacteria have a single long strand of DNA floating freely in the cytoplasm of the cell. Viruses have an RNA (occasionally DNA) genome floating in their capsule (head).

Eukaryotic-celled species such as mammals have numerous chromosomes with DNA wound onto them. Not all the DNA is useful - in humans only 5% of 3 Billion bases codes for proteins! The coding parts are called Exons. The rest of the DNA is often called "junk DNA" but some people think it still plays an role in the cell but they aren't exactly sure what. Non-coding parts are called Introns.

The first genome to ever get sequenced was that of the virus, a Bacteriophage, a virus that infects bacteria. It had an RNA genome of 3569 bases. Viruses are actually not living organisms but instead are a structure made of proteins in which some genetic material is stored in the head, and when the virus manages to latch on to a cell, it creates a pore in the cell membrane and inserts its RNA genome, which the host cell unwittingly takes and replicates to create more viruses. Viruses tend to cause cell death and the cell can explode or bud-off packages to release the virus copies. No-one is really sure where viruses came from or how they came to be.

The first genome of a living organism to be sequenced was that of the bacteria Haemophilus influenzae in 1995, done using the Shotgun method, as a proof of concept. It had 1.83 Million bases (Mb) on one chromosome!

The shotgun method involves cutting up the whole genome using a restriction enzyme (an enzyme that cuts DNA when it finds a specific base sequence) and then analysing the pieces and then sticking them back together like a puzzle. The reading of the bases and the puzzle solving is done by computers. For example, say we cut the genome and we get lots of strands that are similar to each other but overlap each other - we line them up like so:

ATCCCGGATGCTCTG
-----------ATGCTCTGAAAAAATTCCCCCC

The hyphens represent spaces, to fit the overlap, in order to align the sequences. The first fragment shares some of the code from the second fragment so we assume they come from the same place and so in reality the original DNA strand contains the code ATCCCGGATGCTCTGAAAAAATTCCCCCC. The computer does this lots of times with lots of fragments and eventually everything lines up and it publishes the final sequence.

c.f. Eric D. Green (2001); STRATEGIES FOR THE SYSTEMATIC SEQUENCING OF COMPLEX GENOMES; Nature Reviews: Genetics, Aug 2001 Vol2. p573 (here)

What's amazing is that the process is repeated over and over to fill any inconsistancies and gaps in the data and to validate the original sequencing. Some parts of the human genome were repeatedly sequenced up to 12 times. When a draft sequence is published it is sometimes possible to find an X nucleotide base in the middle of a sequence - this signifies that particular base was either unclear or the data was not collected successfully. It is a long and laborious endevour but well worth it as it has led to the genomics revolution which we are currently at the start of - you will soon see so many discoveries thanks to the availability of the genomes of humans and model organisms such as mice, zebra fish, nematode worms, drosophlia flies and the Arabidopsis plant.

The first draft of the human genome was published in late 2000 and then the final draft was published in 2003. It was dicovered the human genome had between 25000 and 30000 genes which was much less than anyone had predicted since even simpler organisms like the rice plant has about 30000 genes (but a smaller genome). Finally, take a look at the table of species that have been sequenced and look at how huge the genome of the single-celled Amoeba is!!! (here)

A good resource, some slides, although slightly innacurate, can be seen here.

Monday 15 December 2008

Primer 1 - Cell Biology and Macromolecules

So, from memory (mostly), these are the things you need to know:

Macromolecule - a whole load of the same type of molecule (a micro molecule) joined up together to form one huge molecule - lots of things in biology are this way.

Cells - The basic building-block of an organism (a living creature). Your body contains between 50 and 100 Trillion cells (src). Some organisms are a single cell - like bacteria and the amoeba. Cells start off from unspecialised cells called "stem" cells and then they differentiate into specialised cells that have a specific function like White Blood Cells or Liver cells or whatever.

Proteins - The basic building block of cells and most of what is inside your body. Proteins are long chains of a small molecules called Amino Acids all connected up. Proteins can be globular - they curl up into a ball-like lump - or they can be structural - they have an organised shape which is essential to their joint function. Most proteins are enzymes and quite a number of them only function in pairs (dimer), triplets (trimer) or quadruplets (quatromer).

Enzymes - Proteins that have a special function: to reduce the energy required for a reaction to occur. They have an active site, which is like a gap on their surface, which docks onto a target that more or less fits its shape. When on the target they reduce the strength of a bond and then some other molecule, usually water, shoots in and breaks the bond. Now the target molecule is in two halves. The breaking molecule hit the target by chance alone - there was nothing, no special force, to draw it in to carry out the reaction. Enzymes are pH and heat sensitive.

Amino Acids - AAs are a special molecule which are the building blocks of proteins. There are 20 different types of Amino Acids - you MUST learn how to draw the basic structure of an amino acid to be a biologist or a bioinformatician and it is highly recommended you learn them all off by heart - I'll help you do this in a later post. Amino Acids are called 'acids' due to their hydroxylic chemical group - CO2H. But they are really amphipathic - they have both an acidic group (polar) and a basic (alkali - non-polar) group - NH2. In the presence of a particular enzyme the basic group hooks up to the acidic group to form a peptide bond (CO-NH) in a condensation reaction (H2O is released). Lots of these reactions form a chain - protein.

DNA - DeoxyriboNucleic Acid. Although you don't need to know how to draw its chemical structure it's fun to learn this off by heart. DNA is found in the nucleus of a cell. It carries all your genetic information (your genes) - the blue print of how to make proteins - how to build a whole organism. Almost all cells in your body have a full set of DNA. DNA is a very specific shape - a double helix. If you look at a picture of DNA (here) there is a small gap between the backbone and a bigger gap. These are called the minor and major grooves. DNA is wound into bundles called Chromosomes inside the nucleus. DNA has two strands that complement each other.

Chromosomes - Humans have 23 unique chromosomes. Different animals have varying numbers of chromosomes. A chromosome is a made of proteins called histones and the DNA threads are wound up on it like string is on a reel. Most human cells have 23 pairs of chromosomes (46) - they have the same DNA on both pairs. The first 22 chromosomes of the 23 individual chromosomes are numbered 1 to 22 while the other two are called X and/or Y. X and Y look like the letters X and Y hence their name and they are "sex chromosomes" - if you have an XX pair in your body you are female and if you have an XY pair you are male. The rest of the chromosomes are called "autosomes". The XY chromosomes form an odd couple - an unusual pairing which will be talked about in a future post.

Nucleotides - these are the building blocks of DNA. They are made up of a phosphate group, a pentose sugar and a base. The backbone is formed by the phosphate and the sugar. The important part is the base. There are 4 types of nucleotide bases Adenine, Thymine, Cytosine, Guanine (A, T, C, G). A ang G are Purines and T and C are Pyrimidines (don't ask). The bases point inwards from the backbone and they pair up with the other side specifically A-T and C-G. If you have one side having the code ATTCCGGA the other side will have TAAGGCCT.

RNA - RiboNucleic Acid. A lot like DNA except without that extra oxygen on the sugar backbone and that's why RNA doesn't form a double helix. RNA forms hydrogen bonds with the DNA bases in a complementary fashion except that instead of Thymine it has another base called Uracil - so A pairs up with U. DNA is used as a template in a copying process called transcription which involves RNA, now called mRNA (messenger RNA) carrying the code. mRNA is used to make proteins in a process called translation which involves another type of RNA called tRNA (transfer RNA). More about this in another post.

For more information read a textbook or wikipedia or just google it :)

Sunday 14 December 2008

Bioinfornetics - Hello World

Hi! I'm Ahmad, nice to meet you. I live in the UK and have a Masters degree in Bioinformatics & Computational Biology and an undergraduate degree in Medical Sciences. It has been 3 months since I graduated (Sep 2008) yet I'm finding it really difficult finding a job. So while I'm waiting I will try to exercise my skills and keep my knowledge fresh and explore different parts of genetics using bioinformatics. Recently my interest in genetics has been ignited and I'm reading into it and learning lots of new things.

The cool thing is that you can see what I do, how I do it and the source code (usually Java, PHP and Perl), which I'm making public domain so you can copy, modify and do anything you want with it. If you find anything interesting please comment. I'm always seeking better ways to do things so if you have an easier or better way or you want to make a correction then please tell me about it.

Oh, and if you want to contact me my email is a_retha [AT] ymail.com