Sunday 27 November 2016

Using 1000 genomes data files

The 1000 genomes project was established in 2008, a few years after the human genome project (HGP) was completed in 2001. While the HGP was aimed at producing a single reference genome for Homo sapiens, the goal of the 1000 genomes project was to identify the variation present within the species by sequencing 1000s of individuals using fast and cheaper next generation sequencing (NGS) technology. Both projects put their data on line for researchers to download for free.

FASTA Files

The HGP packages the human genome into FASTA format files with one file for every human chromosome (1..22, X, Y) and an accessory FASTA file for the mitochondrial DNA (mtDNA). The FASTA format is a really simple text file format and its suffix tends to end in .fa, .fas or .fasta. The first line of the file consists of a descriptive header beginning with the greater-than symbol '>' followed by the name and/or description and/or unique identifier of the sequence. The next lines after that is the sequence. Typically, 60 characters are present on a line, followed by a new line character '\n', and this pattern goes on till the end of the sequence. Here is an example of part of the Human Cytochrome C gene:

>chromosome:GRCh38:7:25119491:25125961:-1 CTGTCTCTGCCGCAAATGCAGCACCTTCCTCAGTCTTGGGGCGTCCATCAAGCGGAGAGC TGGAATGTGACTGAAAGTAACAGAGTATAGGTGGAACTAGAACAAGGTCACGAGCTGGTG CCCAACTCGTGTAACTCCTGGTTCCAGGCTGTCCCTCTAGGATGCTGTTGGCGCCAGCCG TGGTAGTAGGACAGAGTCTGGGTAAAGGTCGCCTGCTGACAAGATGGTAGCTAGGAGACC TCTAGGTGCGCCAAAATTAGGGCGTCTTTTCC...(etc)

As you can see, the first line describes where it is obtained from. GRCh38 is the latest version of the HGP reference sequence. The gene can be found on chromosome 7 and starts from position 25119491 and ends at position 25125961. -1 indicates it is encoded on the reverse strand. The lines after are the actual genetic code.

The HGP package each assembled whole chromosome's worth of genetic data into individual FASTA files of between 50 and 250MB in size. You can download the gzip-compressed files by going to the ENSEMBL FTP site: ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/

An alternative to downloading each chromosome from the ENSEMBL site is to download an 860MB compressed file containing all the chromosomes in one big FASTA file from the 1000 genomes project. In this blog post I suggest downloading the slightly older GRCh37 version of the HGP, specifically file hs37d5.gz, because it is fully compatible with the VCF files mentioned below. You can obtain the reference genome from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/

You don't have to decompress that gigantic FASTA file because I wrote a little utility application to extract specific chromosomes from the large gzip file and output the chromosome in its own FASTA file: https://github.com/webmasterar/extractChromosome


After you download a chromosome file from ENSEMBL you can just decompress it using your favourite decompression utility such as winrar or if you are on Linux you can write the command: gzip -d Homo_sapiens.GRCh38.dna_rm.chromosome.21.fa.gz.

FASTA is a simple and flexible format. There are no restrictions on the number of sequences in a file as long as they each start with a header line followed by the sequence. Such files are typically known as multi-FASTA files, but they don't have a specific file suffix to indicate that. The format allows for comment lines starting with a semi-colon ';'. There are no restrictions on line length either and a whole sequence could be present on a single line. Typically, when reading FASTA files in a program, they are read line by line, stripping line-ending characters and spaces, skipping past empty and comment lines starting with ';', and storing the title and sequence in memory as strings ready to be searched or analysed.

VCF Files

VCF files are also text files but are a more complex format than FASTA. VCF files are used to store variants, typically single nucleotide polymorphisms (SNPs) and consist of a set of mandatory and optional tab-separated columns. They can represent more complex variants consisting of insertions and deletions too. VCF files are used together with FASTA reference chromosome files and would be of little use by themselves. Each row of the VCF file is a record of a SNP/variant and points to a position in the reference chromosome where the variation occurs. So together, it is possible to map all variants from multiple donors onto the reference human genome.

VCF files are often distributed as specially compressed files that might resemble gzip files (since they end with .gz) but are actually compressed using BGZIP. The compression utility BGZIP is part of a package known as Tabix, both of which are now part of the htslib library and packaged with SAMTools. If you ever need to use BGZIP you need only download the Tabix package using your Linux distro's package manager, e.g. sudo apt-get install tabix. Tabix contains a library for reading tab-separated file formats and together with BGZIP it allows applications to read the tab-separated VCF format in its compressed state. This is especially important because VCF files are huge. Chromosome 1's compressed VCF file is 1.2GB in size and when it is decompressed it takes up around 70GB of disk space! Imagine having to decompress all the human chromosomes - a time-consuming and disk-consuming exercise that is ultimately unnecessary due to Tabix and its BGZIP compression algorithm.

The VCF format contains rows separated by lines and each line is separated by tabs for each of the columns. The first few lines in the file are metadata and comments starting with a hash '#' character and contain details about the file and it's corresponding reference sequence, and information about indels and other complex variants as well as details about optional column headers. The mandatory headers are:

#CHROM - the chromosome number
POS - the position that the SNP/variant occurs in the reference sequence
ID - just the ID of the SNP/variant
REF - the base that should be found in the reference sequence at this position
ALT - the alternate base or bases that might be present here*
QUAL - the quality of the read
FILTER - only accept this record if it says 'PASS'
INFO - various details about phasing and such

*The ALT field may contain references to complex variants written in angle brackets like so '<ADD:INS:1>'.

Here is an example record from a VCF file:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096
10   60523  rs148087467    T     G       100     PASS    AC=0;AF=0.01;AFR_AF=0.06;AMR_AF=0.0028;AN=2; GT:GL 0|0:-0.19,-0.46,-2.28


You can see it is a T to G single nucleotide polymorphism occurring at position 60523 on chromosome 10.

Most VCF files list the variants in ascending order because many software applications that read VCF files expect it to be in order. It is possible to have multiple entries for the same position of a chromosome in VCF files and thus have duplicates. Dealing with this issue will be resolved in a future blog post. Typically, you would not write your own parser to go through the VCF file because it has a complex compression and file format. For a C++ project I used VCFLib.

The 1000 genomes project was performed in three major phases. The first or pilot phase obtained DNA from 179 individuals, the second (Phase 1) from 1092 individuals (including the pilots) and the third phase (Phase 3) from 2504 individuals (including the people from the first two phases). They were sequenced with improving NGS technologies and the data was collected into VCF files for each chromosome. Since this project started in 2008, the reference human genome at the time was GRCh37 and as such the VCF files should use that as it's reference format. This is mentioned in the comments of the phase 3 VCF files. When you download the compressed (.gz) VCF file you will also need to download its accompanying Tabix index file (.tbi) which contains the index that helps Tabix read the file in its compressed state. You DO NOT need to decompress the VCF files. The files for phase 3 can be downloaded from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

In conclusion, to use the 1000 genomes files, you will need to download the reference human genome sequence chromosome by chromosome (or as one big 850MB compressed file), the VCF file for the chromosome and its accompanying TBI file. There are is no need to decompress the files.

No comments:

Post a Comment