Thursday, March 31, 2011

Fedora Core 14 Mahout Installation



Background:
This blog post is inspired by the need of a coworker of mine to get Mahout up and running on one of our development machines (OS: FC14). There is no content here that can not be found at other locations. In particular you can find more detailed information from the Hadoop common documentation page: http://hadoop.apache.org/common/docs/current/single_node_setup.html
and at the mahout project page: http://mahout.apache.org/

We are starting with a Fedora Core 14 32bit guest machine running in VMWare Workstation. Host is Windows XP SP3. All libraries required to get VMWare tools installed and running have been pre-installed so there may be some dependencies already in place that are not listed. The Fedora core instance is fully updated as of today. SELinux has also already been disabled.

First Step: Install Hadoop Common in Single Node Distrubuted Mode:
The end goal is to run Mahout but we will need to first install an instance of Hadoop that our Mahout can run on top of.

Looking at the Hadoop system requirements I can see that Java 1.6x or greater is required as well as ssh and the sshd daemon, lets get those out of the way and setup first.

Java 1.6x:
sudo yum install java-1.6.0-openjdk

ssh and sshd:
sudo yum install openssh openssh-server

Let's start the ssh daemon and make sure that sshd starts up automatically:
sudo /etc/init.d/sshd start
sudo chkconfig --add sshd
sudo chkconfig --levels 2345 sshd on

Next lets setup the JAVA_HOME environment variable so that it is always available for all users. First thing we need to do is figure out where the JAVA_HOME is:
which java
/usr/bin/java


K, dimes to dollars says that this is a link so let's find out where it is pointing.
ls -la /usr/bin/java
/usr/bin/java -> /etc/alternatives/java


K, dimes to dollars this is also a link, one more time:
ls -la /etc/alternatives/java
/etc/alternatives/java -> /usr/lib/jvm/jre-1.6.0-openjdk/bin/java


OK, so now we know that our JAVA_HOME environment variable should be set to: /usr/lib/jvm/jre-1.6.0-openjdk/ . Let's get this done in a friendly way for all users.
sudo touch /etc/profile.d/java_home.sh
sudo su
echo "export JAVA_HOME=/usr/lib/jvm/jre-1.6.0-openjdk/" >> /etc/profile.d/java_home.sh
exit
source /etc/profile.d/java_home.sh


Alright, now go ahead and download hadoop from the following location:
http://hadoop.apache.org/common/releases.html . I am going to go ahead and install Hadoop version 0.20.2. As a fair warning to you. As of right now Mahout is at version 0.4 - 0.5 and Hadoop is at version 0.21.0. Although Mahout documentation states that it works with anything past hadoop version 0.20.2 this is not true, I have tried it with version 0.21.0 but received a linking error when I tried to run the examples. Hadoop version 0.20.0 does work though with this version of Mahout.
You can choose to install Hadoop anywhere you would like, I am going to choose to install it in /usr/local/ as a matter of personal preference. If you would like to install it in another location just make sure to change your path uniformly.
OK, once you have the package downloaded change directory to the download directory and:
tar xof hadoop-0.20.2.tar.gz
sudo mv hadoop-0.20.2 /usr/local/


Now we need to set the HADOOP_HOME variable as before:
sudo touch /etc/profile.d/hadoop.sh
sudo su
echo "export HADOOP_HOME=/usr/local/hadoop-0.20.2" >> /etc/profile.d/hadoop.sh
exit
source /etc/profile.d/hadoop.sh


One more thing we want to do before handing you off to the official documentation is to setup passphraseless ssh login (following borrows heavily from the official documents):
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys


At this point you are safe going to the Apache Common installation documentation and configuring your system in 'Pseudo-Distributed Mode'.
Please follow the instructions available here: http://hadoop.apache.org/common/docs/current/single_node_setup.html

Now on to Mahout!:

Mahout documentation states that it requires Maven2 and it looks like it will also require a copy of subversion (no git?) to checkout the project from source control. Lets get these requirements out of the way first.
sudo yum install maven2
sudo yum install subversion


Now lets checkout the Mahout project and move it to /usr/local/ directory (same disclaimer as above).
svn co http://svn.apache.org/repos/asf/mahout/trunk mahout
sudo mv mahout /usr/local


Same as above, we need to now set the MAHOUT_HOME environment variable.
sudo su
echo "export MAHOUT_HOME=/usr/local/mahout" >> /etc/profile.d/hadoop.sh
exit
source /etc/profile.d/hadoop.sh


OK, now lets run the compiler and installers. A note here, each compile process has optional unit tests that run. For me some of these unit tests are failing and if this happens for you you may want to investigate. When the unit tests do not fail they can take a very long time (We are talking like go get coffee and/or lunch long time). If you would like to make sure the unit tests do not run for whatever reason just pass -DskipTests=true to maven. Enough talking, lets go.
cd $MAHOUT_HOME
sudo mvn -DskipTests=true install
cd core
sudo mvn compile
sudo mvn install

Validate Installation, Run an Example:

To validate that everything is running correctly we are going to run the example listed here:
https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data

<Section to be finished later>

One More Gotcha!:
If you are like me then you are used to retrieving hadoop results directly as text files. It turns out that mahout natively stores results in a byte format that is unreadable. In order to get the data you are trying to view into a text format that you can open in the editor of your choice you will need to first convert these files. More information on this process can be found here: https://cwiki.apache.org/MAHOUT/cluster-dumper.html

Fin:
We now have a running instance of mahout that can be used for development level machine learning. As always if anyone has any questions or comments please make sure to post them.
Thank you.

No comments:

Post a Comment