Background:
This blog post is inspired by the need of a coworker of mine to get Mahout up and running on one of our development machines (OS: FC14). There is no content here that can not be found at other locations. In particular you can find more detailed information from the Hadoop common documentation page: http://hadoop.apache.org/common/docs/current/single_node_setup.html
and at the mahout project page: http://mahout.apache.org/
We are starting with a Fedora Core 14 32bit guest machine running in VMWare Workstation. Host is Windows XP SP3. All libraries required to get VMWare tools installed and running have been pre-installed so there may be some dependencies already in place that are not listed. The Fedora core instance is fully updated as of today. SELinux has also already been disabled.
First Step: Install Hadoop Common in Single Node Distrubuted Mode:
The end goal is to run Mahout but we will need to first install an instance of Hadoop that our Mahout can run on top of.
Looking at the Hadoop system requirements I can see that Java 1.6x or greater is required as well as ssh and the sshd daemon, lets get those out of the way and setup first.
Java 1.6x:
sudo yum install java-1.6.0-openjdk
ssh and sshd:
sudo yum install openssh openssh-server
Let's start the ssh daemon and make sure that sshd starts up automatically:
sudo /etc/init.d/sshd start
sudo chkconfig --add sshd
sudo chkconfig --levels 2345 sshd on
sudo chkconfig --add sshd
sudo chkconfig --levels 2345 sshd on
Next lets setup the JAVA_HOME environment variable so that it is always available for all users. First thing we need to do is figure out where the JAVA_HOME is:
which java
/usr/bin/java
K, dimes to dollars says that this is a link so let's find out where it is pointing.
ls -la /usr/bin/java
/usr/bin/java -> /etc/alternatives/java
K, dimes to dollars this is also a link, one more time:
ls -la /etc/alternatives/java
/etc/alternatives/java -> /usr/lib/jvm/jre-1.6.0-openjdk/bin/java
OK, so now we know that our JAVA_HOME environment variable should be set to: /usr/lib/jvm/jre-1.6.0-openjdk/ . Let's get this done in a friendly way for all users.
sudo touch /etc/profile.d/java_home.sh
sudo su
echo "export JAVA_HOME=/usr/lib/jvm/jre-1.6.0-openjdk/" >> /etc/profile.d/java_home.sh
exit
source /etc/profile.d/java_home.sh
sudo su
echo "export JAVA_HOME=/usr/lib/jvm/jre-1.6.0-openjdk/" >> /etc/profile.d/java_home.sh
exit
source /etc/profile.d/java_home.sh
Alright, now go ahead and download hadoop from the following location:
http://hadoop.apache.org/common/releases.html . I am going to go ahead and install Hadoop version 0.20.2. As a fair warning to you. As of right now Mahout is at version 0.4 - 0.5 and Hadoop is at version 0.21.0. Although Mahout documentation states that it works with anything past hadoop version 0.20.2 this is not true, I have tried it with version 0.21.0 but received a linking error when I tried to run the examples. Hadoop version 0.20.0 does work though with this version of Mahout.
You can choose to install Hadoop anywhere you would like, I am going to choose to install it in /usr/local/ as a matter of personal preference. If you would like to install it in another location just make sure to change your path uniformly.
OK, once you have the package downloaded change directory to the download directory and:
tar xof hadoop-0.20.2.tar.gz
sudo mv hadoop-0.20.2 /usr/local/
sudo mv hadoop-0.20.2 /usr/local/
Now we need to set the HADOOP_HOME variable as before:
sudo touch /etc/profile.d/hadoop.sh
sudo su
echo "export HADOOP_HOME=/usr/local/hadoop-0.20.2" >> /etc/profile.d/hadoop.sh
exit
source /etc/profile.d/hadoop.sh
sudo su
echo "export HADOOP_HOME=/usr/local/hadoop-0.20.2" >> /etc/profile.d/hadoop.sh
exit
source /etc/profile.d/hadoop.sh
One more thing we want to do before handing you off to the official documentation is to setup passphraseless ssh login (following borrows heavily from the official documents):
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
At this point you are safe going to the Apache Common installation documentation and configuring your system in 'Pseudo-Distributed Mode'.
Please follow the instructions available here: http://hadoop.apache.org/common/docs/current/single_node_setup.html
Now on to Mahout!:
Mahout documentation states that it requires Maven2 and it looks like it will also require a copy of subversion (no git?) to checkout the project from source control. Lets get these requirements out of the way first.
sudo yum install maven2
sudo yum install subversion
sudo yum install subversion
Now lets checkout the Mahout project and move it to /usr/local/ directory (same disclaimer as above).
svn co http://svn.apache.org/repos/asf/mahout/trunk mahout
sudo mv mahout /usr/local
sudo mv mahout /usr/local
Same as above, we need to now set the MAHOUT_HOME environment variable.
sudo su
echo "export MAHOUT_HOME=/usr/local/mahout" >> /etc/profile.d/hadoop.sh
exit
source /etc/profile.d/hadoop.sh
echo "export MAHOUT_HOME=/usr/local/mahout" >> /etc/profile.d/hadoop.sh
exit
source /etc/profile.d/hadoop.sh
OK, now lets run the compiler and installers. A note here, each compile process has optional unit tests that run. For me some of these unit tests are failing and if this happens for you you may want to investigate. When the unit tests do not fail they can take a very long time (We are talking like go get coffee and/or lunch long time). If you would like to make sure the unit tests do not run for whatever reason just pass -DskipTests=true to maven. Enough talking, lets go.
cd $MAHOUT_HOME
sudo mvn -DskipTests=true install
cd core
sudo mvn compile
sudo mvn install
sudo mvn -DskipTests=true install
cd core
sudo mvn compile
sudo mvn install
Validate Installation, Run an Example:
To validate that everything is running correctly we are going to run the example listed here:
https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data
<Section to be finished later>
One More Gotcha!:
If you are like me then you are used to retrieving hadoop results directly as text files. It turns out that mahout natively stores results in a byte format that is unreadable. In order to get the data you are trying to view into a text format that you can open in the editor of your choice you will need to first convert these files. More information on this process can be found here: https://cwiki.apache.org/MAHOUT/cluster-dumper.html
Fin:
We now have a running instance of mahout that can be used for development level machine learning. As always if anyone has any questions or comments please make sure to post them.
Thank you.