Sunday, 27 November 2016

Working with huge files in Unix/Linux

The Unix or GNU/Linux command line has a variety of tools that are essential when working with gigantic files. Bioinformaticians often work with files that are many gigabytes in size. Recently, I was working with very large VCF and FASTA files and wanted to check their contents. Both files came gzipped so I needed to decompress them first.

~$ gzip -d file.vcf.gz
~$ gzip -d seq.fasta.gz

Now, what I had at the end of this was an 11GB VCF file and a 250MB fasta file - way too big to open up in any text editor. But I can use a variety of utilities available on the Linux command line to work with it.

To view the first few lines I could use head like so:

~$ head file.vcf

By default, this did not display enough lines to satisfy my requirements. I needed to view the first 100 lines so I used the -n paramter like so:

~$ head -n 100 file.vcf

I can also use head to grab the first few characters if I am reading from a FASTA file with a really long sequence on a single line. If I wanted to grab the first 1000 characters I can write:

~$ head -c 1000 seq.fasta

I can also use the same parameter to grab the last few characters or lines from the file using tail like so:

~$ tail -c 1000 seq.fasta

This is all good and well but what if I want to inspect only certain lines from a file - say from lines 10 - 20? I can use awk for that:

~$ awk 'NR==10,NR==20' file.vcf

I can also use awk to skip the first 10 lines like so:

~$ awk 'NR>10' file.vcf 

OK. What if I want to count how many characters are in the second line of a FASTA file where, in some cases, a whole sequence put onto one line? I can use a combination of awk and wc to count the letters:

~$ awk 'NR==2' seq.fasta | wc -c

If you want to count how many lines and how many characters there are in a file you can use wc like this passing it both the l and c flags:

~$ wc -lc seq.fasta

Now, if I want to navigate through the whole VCF file to see what it contained, I could not use a text editor. Instead, I would use 'less'. 'less' is better than 'more' if you are more familiar with that. It allows you to navigate to certain line numbers and you can use PgUp and PgDn and the up and down arrows to navigate through a file. You press Q to quit back to your console when you are done.

~$ less file.vcf

Less is OK for some things but it takes quite some time to read a really really big file. It may sometimes be easier to make a smaller sample file from the big file so you can comfortably inspect it. Use the redirection operator '>' to output say 1000 lines to a new file called smaller.vcf:

~$ head -n 1000 file.vcf > smaller.vcf

OK. There's something I didn't mention before which may save you a lot of time if you are in a rush or really short of disk space - you don't need spend all that time decompressing the VCF or FASTA files! You can decompress them on the fly on the command line and pipe a little of the decompressed output to one of those command line tools mentioned above. For example, you just want to grab 1000 lines from file.vcf.gz which is 11GB when decompressed, but you don't want to wait an hour for it to decompress and you only have 4GB of free disk space to spare. Try this:

~$ gzip -dc file.vcf.gz | head -n 1000 > just1000lines.vcf

The trick is that gzip can decompress things out to the console using the -c flag. So while it's decompressing, the output is piped to head, which continues to collect data until 1000 lines are read and then instead of printing all those lines out to the console, we redirect the output to insert into a file: just1000lines.vcf. This all happens very quickly and efficiently in memory and only 1000 lines are written to disk, saving you time and disk space.

Hopefully these command line tools - gzip, head, tail, awk, wc, less - and the redirection operator '>' will help you when you are working with large files.

Using 1000 genomes data files

The 1000 genomes project was established in 2008, a few years after the human genome project (HGP) was completed in 2001. While the HGP was aimed at producing a single reference genome for Homo sapiens, the goal of the 1000 genomes project was to identify the variation present within the species by sequencing 1000s of individuals using fast and cheaper next generation sequencing (NGS) technology. Both projects put their data on line for researchers to download for free.

FASTA Files

The HGP packages the human genome into FASTA format files with one file for every human chromosome (1..22, X, Y) and an accessory FASTA file for the mitochondrial DNA (mtDNA). The FASTA format is a really simple text file format and its suffix tends to end in .fa, .fas or .fasta. The first line of the file consists of a descriptive header beginning with the greater-than symbol '>' followed by the name and/or description and/or unique identifier of the sequence. The next lines after that is the sequence. Typically, 60 characters are present on a line, followed by a new line character '\n', and this pattern goes on till the end of the sequence. Here is an example of part of the Human Cytochrome C gene:

>chromosome:GRCh38:7:25119491:25125961:-1 CTGTCTCTGCCGCAAATGCAGCACCTTCCTCAGTCTTGGGGCGTCCATCAAGCGGAGAGC TGGAATGTGACTGAAAGTAACAGAGTATAGGTGGAACTAGAACAAGGTCACGAGCTGGTG CCCAACTCGTGTAACTCCTGGTTCCAGGCTGTCCCTCTAGGATGCTGTTGGCGCCAGCCG TGGTAGTAGGACAGAGTCTGGGTAAAGGTCGCCTGCTGACAAGATGGTAGCTAGGAGACC TCTAGGTGCGCCAAAATTAGGGCGTCTTTTCC...(etc)

As you can see, the first line describes where it is obtained from. GRCh38 is the latest version of the HGP reference sequence. The gene can be found on chromosome 7 and starts from position 25119491 and ends at position 25125961. -1 indicates it is encoded on the reverse strand. The lines after are the actual genetic code.

The HGP package each assembled whole chromosome's worth of genetic data into individual FASTA files of between 50 and 250MB in size. You can download the gzip-compressed files by going to the ENSEMBL FTP site: ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/

An alternative to downloading each chromosome from the ENSEMBL site is to download an 860MB compressed file containing all the chromosomes in one big FASTA file from the 1000 genomes project. In this blog post I suggest downloading the slightly older GRCh37 version of the HGP, specifically file hs37d5.gz, because it is fully compatible with the VCF files mentioned below. You can obtain the reference genome from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/

You don't have to decompress that gigantic FASTA file because I wrote a little utility application to extract specific chromosomes from the large gzip file and output the chromosome in its own FASTA file: https://github.com/webmasterar/extractChromosome


After you download a chromosome file from ENSEMBL you can just decompress it using your favourite decompression utility such as winrar or if you are on Linux you can write the command: gzip -d Homo_sapiens.GRCh38.dna_rm.chromosome.21.fa.gz.

FASTA is a simple and flexible format. There are no restrictions on the number of sequences in a file as long as they each start with a header line followed by the sequence. Such files are typically known as multi-FASTA files, but they don't have a specific file suffix to indicate that. The format allows for comment lines starting with a semi-colon ';'. There are no restrictions on line length either and a whole sequence could be present on a single line. Typically, when reading FASTA files in a program, they are read line by line, stripping line-ending characters and spaces, skipping past empty and comment lines starting with ';', and storing the title and sequence in memory as strings ready to be searched or analysed.

VCF Files

VCF files are also text files but are a more complex format than FASTA. VCF files are used to store variants, typically single nucleotide polymorphisms (SNPs) and consist of a set of mandatory and optional tab-separated columns. They can represent more complex variants consisting of insertions and deletions too. VCF files are used together with FASTA reference chromosome files and would be of little use by themselves. Each row of the VCF file is a record of a SNP/variant and points to a position in the reference chromosome where the variation occurs. So together, it is possible to map all variants from multiple donors onto the reference human genome.

VCF files are often distributed as specially compressed files that might resemble gzip files (since they end with .gz) but are actually compressed using BGZIP. The compression utility BGZIP is part of a package known as Tabix, both of which are now part of the htslib library and packaged with SAMTools. If you ever need to use BGZIP you need only download the Tabix package using your Linux distro's package manager, e.g. sudo apt-get install tabix. Tabix contains a library for reading tab-separated file formats and together with BGZIP it allows applications to read the tab-separated VCF format in its compressed state. This is especially important because VCF files are huge. Chromosome 1's compressed VCF file is 1.2GB in size and when it is decompressed it takes up around 70GB of disk space! Imagine having to decompress all the human chromosomes - a time-consuming and disk-consuming exercise that is ultimately unnecessary due to Tabix and its BGZIP compression algorithm.

The VCF format contains rows separated by lines and each line is separated by tabs for each of the columns. The first few lines in the file are metadata and comments starting with a hash '#' character and contain details about the file and it's corresponding reference sequence, and information about indels and other complex variants as well as details about optional column headers. The mandatory headers are:

#CHROM - the chromosome number
POS - the position that the SNP/variant occurs in the reference sequence
ID - just the ID of the SNP/variant
REF - the base that should be found in the reference sequence at this position
ALT - the alternate base or bases that might be present here*
QUAL - the quality of the read
FILTER - only accept this record if it says 'PASS'
INFO - various details about phasing and such

*The ALT field may contain references to complex variants written in angle brackets like so '<ADD:INS:1>'.

Here is an example record from a VCF file:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096
10   60523  rs148087467    T     G       100     PASS    AC=0;AF=0.01;AFR_AF=0.06;AMR_AF=0.0028;AN=2; GT:GL 0|0:-0.19,-0.46,-2.28


You can see it is a T to G single nucleotide polymorphism occurring at position 60523 on chromosome 10.

Most VCF files list the variants in ascending order because many software applications that read VCF files expect it to be in order. It is possible to have multiple entries for the same position of a chromosome in VCF files and thus have duplicates. Dealing with this issue will be resolved in a future blog post. Typically, you would not write your own parser to go through the VCF file because it has a complex compression and file format. For a C++ project I used VCFLib.

The 1000 genomes project was performed in three major phases. The first or pilot phase obtained DNA from 179 individuals, the second (Phase 1) from 1092 individuals (including the pilots) and the third phase (Phase 3) from 2504 individuals (including the people from the first two phases). They were sequenced with improving NGS technologies and the data was collected into VCF files for each chromosome. Since this project started in 2008, the reference human genome at the time was GRCh37 and as such the VCF files should use that as it's reference format. This is mentioned in the comments of the phase 3 VCF files. When you download the compressed (.gz) VCF file you will also need to download its accompanying Tabix index file (.tbi) which contains the index that helps Tabix read the file in its compressed state. You DO NOT need to decompress the VCF files. The files for phase 3 can be downloaded from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

In conclusion, to use the 1000 genomes files, you will need to download the reference human genome sequence chromosome by chromosome (or as one big 850MB compressed file), the VCF file for the chromosome and its accompanying TBI file. There are is no need to decompress the files.