I spend the last two days building a Hadoop cluster at work. Here are the steps you could go through to make your own cluster.
This cluster uses:
- A fresh install of Ubuntu 10.04 LTS 64-bit Server (with only the OpenSSH package included on the install)
- Apache Hadoop 0.20.203.0 rc1 (the current stable version as of this writing)
Hadoop needs a certain version of java to work correctly. These steps get the java installed that you need for Hadoop. The first line for the python-sfotware-properties just gets add-apt-repository to work correctly.
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:sun-java-community-team/sun-java6
sudo apt-get update
sudo apt-get install sun-java6-jre sun-java6-bin
Next download hadoop, I found mine at Wayne.edu
Unzip this file with the command
tar xvf hadoop-0.20.203.0rc1.tar.gz
and cd into the folder. The Apache Hadoop site has tests you can run in single node (1 computer) mode to make sure that Hadoop is working properly. Run those and make sure all is well. Getting Hadoop running as one node first is an important debugging step.
Hadoop isn’t interesting as one node, you’ll want to make a multi-machine cluster to crunch large datasets. My cluster has 6 machines on it. With this many machines I can have 2 masters and 4 slaves. The first master machine is my NameNode, the second master machine is the JobTracker. The remaining 4 machines are the DataNode and TaskTrackers.
You’ll need to setup passphrase-less ssh between all the boxes. There are a few good tutorials I referenced to get the cluster up and running.
I used this single node reference and and this multi-node tutorial as well as the single and cluster mode tutorials that come from Apache. Using all these together I figured out how to get the cluster running.
On trick is that the configuration is in one bash file and a few XML files. These configs need to be copied over to all the nodes so they all have the same configuration.
Now the fun begins with using the cluster.