Sunday 27 November 2016

Working with huge files in Unix/Linux

The Unix or GNU/Linux command line has a variety of tools that are essential when working with gigantic files. Bioinformaticians often work with files that are many gigabytes in size. Recently, I was working with very large VCF and FASTA files and wanted to check their contents. Both files came gzipped so I needed to decompress them first.

~$ gzip -d file.vcf.gz
~$ gzip -d seq.fasta.gz

Now, what I had at the end of this was an 11GB VCF file and a 250MB fasta file - way too big to open up in any text editor. But I can use a variety of utilities available on the Linux command line to work with it.

To view the first few lines I could use head like so:

~$ head file.vcf

By default, this did not display enough lines to satisfy my requirements. I needed to view the first 100 lines so I used the -n paramter like so:

~$ head -n 100 file.vcf

I can also use head to grab the first few characters if I am reading from a FASTA file with a really long sequence on a single line. If I wanted to grab the first 1000 characters I can write:

~$ head -c 1000 seq.fasta

I can also use the same parameter to grab the last few characters or lines from the file using tail like so:

~$ tail -c 1000 seq.fasta

This is all good and well but what if I want to inspect only certain lines from a file - say from lines 10 - 20? I can use awk for that:

~$ awk 'NR==10,NR==20' file.vcf

I can also use awk to skip the first 10 lines like so:

~$ awk 'NR>10' file.vcf 

OK. What if I want to count how many characters are in the second line of a FASTA file where, in some cases, a whole sequence put onto one line? I can use a combination of awk and wc to count the letters:

~$ awk 'NR==2' seq.fasta | wc -c

If you want to count how many lines and how many characters there are in a file you can use wc like this passing it both the l and c flags:

~$ wc -lc seq.fasta

Now, if I want to navigate through the whole VCF file to see what it contained, I could not use a text editor. Instead, I would use 'less'. 'less' is better than 'more' if you are more familiar with that. It allows you to navigate to certain line numbers and you can use PgUp and PgDn and the up and down arrows to navigate through a file. You press Q to quit back to your console when you are done.

~$ less file.vcf

Less is OK for some things but it takes quite some time to read a really really big file. It may sometimes be easier to make a smaller sample file from the big file so you can comfortably inspect it. Use the redirection operator '>' to output say 1000 lines to a new file called smaller.vcf:

~$ head -n 1000 file.vcf > smaller.vcf

OK. There's something I didn't mention before which may save you a lot of time if you are in a rush or really short of disk space - you don't need spend all that time decompressing the VCF or FASTA files! You can decompress them on the fly on the command line and pipe a little of the decompressed output to one of those command line tools mentioned above. For example, you just want to grab 1000 lines from file.vcf.gz which is 11GB when decompressed, but you don't want to wait an hour for it to decompress and you only have 4GB of free disk space to spare. Try this:

~$ gzip -dc file.vcf.gz | head -n 1000 > just1000lines.vcf

The trick is that gzip can decompress things out to the console using the -c flag. So while it's decompressing, the output is piped to head, which continues to collect data until 1000 lines are read and then instead of printing all those lines out to the console, we redirect the output to insert into a file: just1000lines.vcf. This all happens very quickly and efficiently in memory and only 1000 lines are written to disk, saving you time and disk space.

Hopefully these command line tools - gzip, head, tail, awk, wc, less - and the redirection operator '>' will help you when you are working with large files.

No comments:

Post a Comment