Ahmad's Blog: command line

Sunday, 27 November 2016

Working with huge files in Unix/Linux

The Unix or GNU/Linux command line has a variety of tools that are essential when working with gigantic files. Bioinformaticians often work with files that are many gigabytes in size. Recently, I was working with very large VCF and FASTA files and wanted to check their contents. Both files came gzipped so I needed to decompress them first.

~$ gzip -d file.vcf.gz
~$ gzip -d seq.fasta.gz

Now, what I had at the end of this was an 11GB VCF file and a 250MB fasta file - way too big to open up in any text editor. But I can use a variety of utilities available on the Linux command line to work with it.

To view the first few lines I could use head like so:

~$ head file.vcf

By default, this did not display enough lines to satisfy my requirements. I needed to view the first 100 lines so I used the -n paramter like so:

~$ head -n 100 file.vcf

I can also use head to grab the first few characters if I am reading from a FASTA file with a really long sequence on a single line. If I wanted to grab the first 1000 characters I can write:

~$ head -c 1000 seq.fasta

I can also use the same parameter to grab the last few characters or lines from the file using tail like so:

~$ tail -c 1000 seq.fasta

This is all good and well but what if I want to inspect only certain lines from a file - say from lines 10 - 20? I can use awk for that:

~$ awk 'NR==10,NR==20' file.vcf

I can also use awk to skip the first 10 lines like so:

~$ awk 'NR>10' file.vcf

OK. What if I want to count how many characters are in the second line of a FASTA file where, in some cases, a whole sequence put onto one line? I can use a combination of awk and wc to count the letters:

~$ awk 'NR==2' seq.fasta | wc -c

If you want to count how many lines and how many characters there are in a file you can use wc like this passing it both the l and c flags:

~$ wc -lc seq.fasta

Now, if I want to navigate through the whole VCF file to see what it contained, I could not use a text editor. Instead, I would use 'less'. 'less' is better than 'more' if you are more familiar with that. It allows you to navigate to certain line numbers and you can use PgUp and PgDn and the up and down arrows to navigate through a file. You press Q to quit back to your console when you are done.

~$ less file.vcf

Less is OK for some things but it takes quite some time to read a really really big file. It may sometimes be easier to make a smaller sample file from the big file so you can comfortably inspect it. Use the redirection operator '>' to output say 1000 lines to a new file called smaller.vcf:

~$ head -n 1000 file.vcf > smaller.vcf

OK. There's something I didn't mention before which may save you a lot of time if you are in a rush or really short of disk space - you don't need spend all that time decompressing the VCF or FASTA files! You can decompress them on the fly on the command line and pipe a little of the decompressed output to one of those command line tools mentioned above. For example, you just want to grab 1000 lines from file.vcf.gz which is 11GB when decompressed, but you don't want to wait an hour for it to decompress and you only have 4GB of free disk space to spare. Try this:

~$ gzip -dc file.vcf.gz | head -n 1000 > just1000lines.vcf

The trick is that gzip can decompress things out to the console using the -c flag. So while it's decompressing, the output is piped to head, which continues to collect data until 1000 lines are read and then instead of printing all those lines out to the console, we redirect the output to insert into a file: just1000lines.vcf. This all happens very quickly and efficiently in memory and only 1000 lines are written to disk, saving you time and disk space.

Hopefully these command line tools - gzip, head, tail, awk, wc, less - and the redirection operator '>' will help you when you are working with large files.

Wednesday, 13 April 2016

A quick guide to compiling C/C++ code and writing Makefiles (Part 1: Compiling)

This is the first of a 3 series post. This post will be about compiling. The second post will be about Makefiles. The third post will be about compiling libraries.

C/C++ coding is difficult enough as it is to code and debug without having to spend more time figuring out how write a good Makefile to compile the damn code and fix the Makefile bugs! In this post I will talk about compiling using the command line tools. But usually you don't write these commands manually on the console to build a project and be able to run the application. You write the commands in the Makefile and then you just build the project using the Makefile.

A Makefile is used to store the commands needed to build or clean a project. The user may download your project and just has to write the make command on the console to build the project to create the executable binary file that will run on their machine. In Windows you get given binaries like virus.exe with the .exe executable extension that just works on pretty much all Windows computers as long as there are no architectural requirements (64bit versions of software will not run on 32bit systems). But this is not the case for Unix/Linux. With all the different distros available and their internal complexities, it is not likely that a binary executable built on FreeBSD will also run on Ubuntu. Also there is no quick way to state a file is executable because .exe is not used to denote an executable file in the Linux world. So, for Linux mainly, the source code is often packaged so you can dowload and extract it, and you are expected to compile the code using the make command to create the executable for your machine in order to run it.

Which tool you compile your code with varies by operating system and there is a choice of compilers for each system anyway. On Windows I use MingW and on Linux I use GCC. There are often different compilers for C or C++ code too. It should be noted that C++ compilers are capable of compiling C code but sometimes complain about things that are not acceptable anymore in C++. I recommend compiling C code with a C compiler and C++ code with a C++ compiler. You can mix C and C++ code, but if you do, be prepared to edit your C code.

Anyway, compiling. The C compiler tool in Linux is gcc and the C++ compiler is g++. If you use MinGW to build a project on Windows, it will know what you are trying to compile your code with, so it's cool to carry on using gcc and g++. There are two sequential steps to compiling.

(dependent on the project) Make sure you have linked in any required library files (.a/.so) and their header files (.h).
Compile the source code (.c/.cpp) files into object(.o) files.
Combine the object files into one executable file.

Let's start with step two first because I will write another article about compiling and linking libraries at a later time. The following command will compile the code into object files for a hypothetical project consisting of two source files and the header file associated with the important program: exampleProg.c, reusableCmd.c and reusableCmd.h.

gcc -I . -c exampleProg.c reusableCmd.c

This will create the files exampleProg.o and reusableCmd.o. The flag -I . means find the header files required to compile this program are in the current directory (.). If we needed to include other header files in another directory, we would add an additional I flag like so: -I ./../dir2/libX.

Then we have to compile those files into an executable file. On Linux we don't give it an extension usually, but on Windows we specify it ends with .exe. To compile our files on Linux we do:

gcc -I . -o exampleProg exampleProg.o reusableCmd.o

And there you have it. The code will produce the executable file exampleProg, which is ready to run. The C++ compiler works in much the same way too. It is best to add some additional flags to optimize the executable file or improve debugging. Here are some flags I use a lot in my projects:

-Wall = Show all code warnings
-lm = Load the Math library if you #include <math .h%gt;- you need to add this flag explicitly if you include this header!
-funroll-loops = optimize loops
-O3 = Use maximum processor and memory optimizations
-fomit-frame-pointer = Omit some of the debugging pointers included in the executable - use this only once your code works perfectly
-msse4.2 = Use SSE4.2 machine instructions where available to improve performance - this is especially useful on modern Intel processors.
-std=c99 = This sets the expected standard of the compilation. You can also use an even newer standard such as c11 if your compiler supports it.

In the next post I will talk about Makefiles.