GNU Parallel is a great utility to parallelize any computation through the command line. The book Data Science at the Command Line discusses, amongst several other things, how to use GNU Parallel to distribute your data over different machines. The toy example/ tutorial in the book makes three assumptions:
(1) all machines you are using are running Ubuntu or some variant of Linux
(2) you are using a bunch of Amazon EC2 instances to do your parallelization (and hence need to find out the IPs of all your instances in a non-straightforward way)
(3) you are using GNU paste that comes pre-installed on all Ubuntu systems
I am presenting a tutorial that works with the premise that
(1) you are primarily using OS X and might have some Ubuntu machines as some of your instances
(2) all your machines are local (as in connected through a LAN)
(3) you are using the OS X variant of paste (which has a nuance compared to the Ubuntu version)
Here is a walkthrough that basically replicates the toy example in the book, but highlights the differences you’ll need to incorporate in an OS X environment.
First, you can install GNU Parallel on OS X through Homebrew:
$ (sudo) brew install parallel
Next, create your instances file (named ‘instances’), and add the hostnames of your local machines as shown in the screenshot. In my case, Cadmius happens to be Ubuntu 14.04, and macusers-Macbook is running OS X Mavericks. The main machine through which I am parallelizing things is also running OS X Mavericks.
The instances file
You do not need to have Parallel installed on the ‘slave’ machines, but you might want to in case you want finer control.
Also, in case you don’t want to repeatedly enter your SSH password when Parallel is talking to the slave machines (it uses SSH underneath), you might want to enable password-free login to your slaves.
Also notice how in this case I’ve put in the username along with the hostnames; another difference from the book which uses EC2 instances and doesn’t need different usernames for the different IPs.
Great, now we’re ready to test if everything went well. Run the following command on the master machine:
$ seq 1000 | parallel -N100 –pipe –slf instances “(hostname; wc -l) | paste -sd: -“
Testing GNU Parallel
Notice the additional ‘-‘ after the arguments to paste. That is a necessity on OS X. The book doesn’t have it because you do not need it on Linux. Without it though, OS X will complain. With it, both OS X and Ubuntu seem happy (yet you can see the differences in the outputs from the two kinds of machines).
Apart from that difference, the command is copied from the book; it is basically generating a sequence of 1000 numbers, and distributing them to the slaves. The output shows the hostnames and the number of numbers passed over to the slaves. Notice the error message. I’ve mentioned how Parallel doesn’t need to be installed on all machines for basic usage. In this case I did install Parallel on the master and two slaves, but it seems Parallel doesn’t like the fact that macusers-Macbook is an older machine with a Core 2 Duo? Not sure about that. Cadmius happens to have 8 CPUs (and 32 cores).
Finally, run the following to sum the numbers in parallel and then sum the 10 sums on the host machine:
$ seq 1000 | parallel -N100 –pipe –slf instances “paste -sd+ – | bc” | paste -sd+ – | bc
This command is also from the book and will give you the sum – you basically just summed 100 numbers each separately on the slave machines through different parallel processes, and then summed the 10 sums locally. (Again, notice the additional ‘-‘ in the paste commands).
The purpose of this post has been to essentially highlight all the changes I had to make in order to successfully run the toy example from the book on a mixture of OS X and Linux machines all running locally. Thanks for reading.