Friday 29 April 2016

ISO-639-2 list of languages and their directions

I collated a list of the languages (ISO-639-2) and the direction they are most commonly written in and provided it as a TXT or CSV file for download here: https://github.com/webmasterar/languagesWithDirections

Examples:

ara,,ar,Arabic,arabe,rtl
chi,zho,zh,Chinese,chinois,ttb

eng,,en,English,anglais,ltr

This is mostly useful with the correctly displaying text on websites using the HTML bdo tag or CSS styles direction and/or writing-mode.

Sunday 17 April 2016

Latitude and Longitude to Timezone PHP script

I converted a Java code I found online that was really useful to PHP so anyone can use it now. It basically returns the timezone of a location given its latitude and longitude. Check out the example:


<?php
include 'TimezoneMapper.php';

$london = ['lat' => 51.504380, 'lng' => -0.127727];
echo TimezoneMapper::latLngToTimezoneString($london['lat'], $london['lng']);
//outputs: Europe/London

You can get the code here: https://github.com/webmasterar/TimezoneMapperPHP

Wednesday 13 April 2016

A quick guide to compiling C/C++ code and writing Makefiles (Part 1: Compiling)


This is the first of a 3 series post. This post will be about compiling. The second post will be about Makefiles. The third post will be about compiling libraries.

C/C++ coding is difficult enough as it is to code and debug without having to spend more time figuring out how write a good Makefile to compile the damn code and fix the Makefile bugs! In this post I will talk about compiling using the command line tools. But usually you don't write these commands manually on the console to build a project and be able to run the application. You write the commands in the Makefile and then you just build the project using the Makefile.

A Makefile is used to store the commands needed to build or clean a project. The user may download your project and just has to write the make command on the console to build the project to create the executable binary file that will run on their machine. In Windows you get given binaries like virus.exe with the .exe executable extension that just works on pretty much all Windows computers as long as there are no architectural requirements (64bit versions of software will not run on 32bit systems). But this is not the case for Unix/Linux. With all the different distros available and their internal complexities, it is not likely that a binary executable built on FreeBSD will also run on Ubuntu. Also there is no quick way to state a file is executable because .exe is not used to denote an executable file in the Linux world. So, for Linux mainly, the source code is often packaged so you can dowload and extract it, and you are expected to compile the code using the make command to create the executable for your machine in order to run it.

Which tool you compile your code with varies by operating system and there is a choice of compilers for each system anyway. On Windows I use MingW and on Linux I use GCC. There are often different compilers for C or C++ code too. It should be noted that C++ compilers are capable of compiling C code but sometimes complain about things that are not acceptable anymore in C++. I recommend compiling C code with a C compiler and C++ code with a C++ compiler. You can mix C and C++ code, but if you do, be prepared to edit your C code.

Anyway, compiling. The C compiler tool in Linux is gcc and the C++ compiler is g++. If you use MinGW to build a project on Windows, it will know what you are trying to compile your code with, so it's cool to carry on using gcc and g++. There are two sequential steps to compiling.
  1. (dependent on the project) Make sure you have linked in any required library files (.a/.so) and their header files (.h).
  2. Compile the source code (.c/.cpp) files into object(.o) files.
  3. Combine the object files into one executable file.
Let's start with step two first because I will write another article about compiling and linking libraries at a later time. The following command will compile the code into object files for a hypothetical project consisting of two source files and the header file associated with the important program: exampleProg.c, reusableCmd.c and reusableCmd.h.

gcc -I . -c exampleProg.c reusableCmd.c

This will create the files exampleProg.o and reusableCmd.o. The flag -I . means find the header files required to compile this program are in the current directory (.). If we needed to include other header files in another directory, we would add an additional I flag like so: -I ./../dir2/libX.

Then we have to compile those files into an executable file. On Linux we don't give it an extension usually, but on Windows we specify it ends with .exe. To compile our files on Linux we do:

gcc -I . -o exampleProg exampleProg.o reusableCmd.o

And there you have it. The code will produce the executable file exampleProg, which is ready to run. The C++ compiler works in much the same way too. It is best to add some additional flags to optimize the executable file or improve debugging. Here are some flags I use a lot in my projects:

-Wall = Show all code warnings
-lm = Load the Math library if you #include <math .h%gt;- you need to add this flag explicitly if you include this header!
-funroll-loops = optimize loops
-O3 = Use maximum processor and memory optimizations
-fomit-frame-pointer = Omit some of the debugging pointers included in the executable - use this only once your code works perfectly
-msse4.2 = Use SSE4.2 machine instructions where available to improve performance - this is especially useful on modern Intel processors.
-std=c99 = This sets the expected standard of the compilation. You can also use an even newer standard such as c11 if your compiler supports it.


In the next post I will talk about Makefiles.

Tuesday 12 April 2016

Counting Sort (index-based sort) reinvented

So the past weekend I independently came up with what I would later find out was the counting sort and I went as far as implementing it before I found out it already existed. Awkward. I initially called it the quick dictionary sort and afterwards renamed it to the quick tally sort (qtsort) because a "dictionary" is a list of words and not individual characters and numbers which is what this algorithm is used for.

The counting sort is a linear time O(n) sorting algorithm because it does not work by an element value comparison technique, which has a proven best worst case complexity of O(n log n). Many sorting algorithms are O(n2) in the worst case, but satisfyingly the counting sort is linear. That's a revelation when it comes to sorting because it is so fast! The best thing about the counting sort is that it is so elegantly simple and can take up little extra memory too. The space complexity is also very small if the range of the elements is small O(nmax - nmin).

Let's go through an example with a character array to clarify. Imagine a string t made of a set of characters: t = {A,D,B,E,A}. We know that the ascii value of character A is 65, B is 66 and the rest of the alphabet is ordered sequentionally in the same way. We can create an array in the range of characters in t to hold all the unique values. The character A is the smallest value (65) in t and the character E is the largest (69).

We create an array T, which stands for tally, and it should hold the frequency each character occurs in text t. As arrays start at index 0, we can make any character in t, A through D, be represented by an index of our array by subtracting the minimum value from each characters. Also, our array T, will be of size = tmax - tmin + 1, or 5 = 69 - 65 + 1. To let the letter A be represented as the first index (0) we do: index = A - tmin. T looks like this when we initialize it and we haven't gone through the text yet:

 idx|val
+---+---+
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
+---+---+

Notice that there is position 2 in the array even when we don't have the character C in t. We go through each character in t and each time we increment the value at the calculated index of that character, so eventually our T array looks like this. We have two A's, and one each of B, D and E.

 idx|val
+---+---+
| 0 | 2 |
| 1 | 1 |

| 2 | 0 |
| 3 | 1 |
| 4 | 1 |
+---+---+

The T array is filled out using a simple loop which goes through each character of t and increments the counter, so by the end of the loop it has made a tally of how often each character occurs. This action is represented by this bit of psuedo code:

foreach char in t:
    T[char - tmin] = T[char - tmin] + 1

Now that we have filled the tally array, we can go through it in sequential order and wherever the value is 1 or more, we print that many of the character. This can be implemented with a simple for loop through the range of the items in T. With each index in the array, we work out which character it really represents by adding tmin back to it, so with the first element, c = idx + tmin, A= 65 = 0 + 65; and with the next index, B = 66 = 1 + 65. If the value at an index in T is 0, meaning no characters exist for it in t, then we skip it and go to the next index. Here is the psuedo code:

foreach idx,val in T:
    if val > 0:
        c = idx + tmin
        while val > 0:
            print c
            val = val - 1


This simple bit of code would then print out t in the correct order: AABDE. The algorithm runs linear in time O(n), but is limited by the range of the values, so if you had an huge range of numbers it would occupy a lot of memory for T and would need to go through all of T in order to print it out. Therefore the Counting sort is best applied to long strings or arrays of numbers within a small range of values. It is unfortunately unsuitable for use with other types of data such as floats or words. It's role lies in sorting lists of numbers, such as table indexes, and characters very quickly and using little memory. It is better to use it than other sorts for certain datasets and will yield a result faster than O(n log n) time.

You can check out a working example on github written in C and works only with characters currently, but can be modified with ease to work with larger integers. You don't need to specify tmin or tmax because the program works that out in linear time too! Here it is: https://github.com/webmasterar/qtsort