Friday 1 December 2017

Getting DNA out of a FASTA file by position or chromosome id

A common task in bioinformatics is to read a FASTA file to get a sequence from it. You might want to grab a whole chromosome from a genome (multi-)FASTA file, or you might want to grab a bit of DNA from a single chromosome by providing the start position and end position (range). In both cases, I have you covered:

For extracting a chromosome from a genome file, like the 900MB GRCh37.gz file provided by the 1000 genomes project, which contains all the human chromosomes in one file - I created a program to extract individual chromosomes to a FASTA file. You don't even need to extract the original .gz file - it handles it as is. Check it out here: https://github.com/webmasterar/extractChromosome

For extracting a string of bases from a chromosome stored in a FASTA file just by providing the start and end positions, I created a little Python helper function called getRef() and you can access it here: https://gist.github.com/webmasterar/3a60155d4ddc8595b17fa2c62893dbb0

It is easy to use and takes three arguments: getRef(Chr_File, Start_Pos, End_Pos).