UMBC High Performance Computing Facility

Temporary Hadoop cluster

Hadoop is a software framework implemented in Java for distributed data storage and processing. In this tutorial, we will show you how to initiate a Hadoop instance, how to set up your local environment to run Hadoop, how to create Hadoop functions, how to write files to the Hadoop Distributed File System (HDFS), and how to retrieve results from HDFS once your job completes.

How to initiate a Hadoop instance

This is a slurm wrapper that gives an working method of starting hadoop instance on the cluster. Note that only a total of 16 nodes can be used at any given point.

[schou@maya-usr1 ~]$ start_hadoop
...
===  hdfs229969  ===
...

To check the number of hadoop instances available:

[schou@maya-usr1 ~]$ scontrol show lic hadoop
LicenseName=hadoop
    Total=16 Used=0 Free=16 Remote=no

A walk through basic example for a word count

To ensure all necessary environment variables and modules are loaded please type the following:

$ module load default-environment
$ module load hadoop/[instance name here]

Now that you have these modules loaded, lets create a new Java file called WordCount.java, which just keeps a count of how many times each unique word appears and will act as our Hadoop processing function:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());

        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Download: ../code/basicWC-Hadoop/WordCount.java

Once you have created this file, we can compile it and turn it into a jar file using the following commands:

$ hadoop com.sun.tools.javac.Main WordCount.java
$ jar cf wc.jar WordCount*.class

Now that we have our Hadoop function compiled and placed into a jar format, we need data to process, for now we will use this single text file:

Hello world
Goodbye world

Download: ../code/basicWC-Hadoop/test.txt

We also need to create a directory in HDFS and load the test file into the directory, which we can do with the following commands:

$ hadoop fs -mkdir /user/slet1/input
$ hadoop fs -put test.txt /user/slet1/input/

We can do a Hadoop 'ls' to check whether the file was loaded correctly. If done correctly, the output should produce something close to the following:

$ hadoop fs -ls /user/slet1/input
Found 1 items
-rw-------   3 slet1 hadoop         82 2015-04-08 14:24 /user/slet1/input/test.txt

The data is now stored onto the HDFS, so we will run the function we had created before:

$ hadoop jar wc.jar WordCount /user/slet1/input/test.txt /user/slet1/output

In this command we are telling hadoop that we are passing it a jar archive file with the main function located in a .java file called WordCount, that the file we want to process is /user/slet1/input/test.txt, and that we want output written to a folder with the path: /user/slet1/output

Once the job completes, do an 'ls' on the directory that you wrote your output to and you should see something close to the following:

$ hadoop fs -ls /user/slet1/output
Found 2 items
-rw-------   3 slet1 hadoop          0 2015-04-08 14:28 /user/slet1/output/_SUCCESS
-rw-------   3 slet1 hadoop         88 2015-04-08 14:28 /user/slet1/output/part-r-00000

We want to print out the contents of the file named 'part-r-00000', so we'll use a Hadoop 'cat' operation:

$ hadoop fs -cat /user/slet1/output/part-r-00000
Goodbye 1
Hello   1
world   2

To retrieve the output file from HDFS and print out the contents locally we can use the following command:

$ hadoop fs -get /user/slet1/output/part-r-00000 output.txt
$ cat output.txt
Goodbye 1
Hello   1
world   2

Now, to clean up the directories after having run our job, type the following:

$ hadoop fs -rm -r /user/slet1/input /user/slet1/output

Example with command line arguments

In this tutorial we will give an example of how to pass command line arguments to a Hadoop function and have the function process multiple files. This time we will use a very similar function with a few key changes. As the name of the file implies, this function will search for a specific word passed through the command line and print out the number of times it appears in the dataset. As an important note, to pass arguments from the 'main' function to either the Map or Reduce classes you must pack the data within a Configuration object, which you can then access from within either of these classes.

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordSearch {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      String search_phrase = context.getConfiguration().get( "search" );
      while (itr.hasMoreTokens() ) {
        word.set(itr.nextToken().toLowerCase());
        if( word.toString().equals( search_phrase ) )
          context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    conf.set( "search", args[ 2 ] );
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordSearch.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Download: ../code/commandWC-Hadoop/WordSearch.java

To compile the code and convert to a jar file use the following commands:

$ hadoop com.sun.tools.javac.Main WordSearch.java
$ jar cf ws.jar WordSearch*.class

This time for our datasets we will use two larger files: const.txt and decl.txt Like before, we need to create an input directory in HDFS and pass our files to it:

$ hadoop fs -mkdir /user/slet1/input
$ hadoop fs -put const.txt /user/slet1/input/
$ hadoop fs -put decl.txt /user/slet1/input/

Doing a Hadoop 'ls' should display the following:

$ hadoop fs -ls /user/slet1/input
Found 2 items
-rw-------   3 slet1 hadoop      45771 2015-04-08 16:37 /user/slet1/input/const.txt
-rw-------   3 slet1 hadoop       8031 2015-04-08 16:37 /user/slet1/input/decl.txt

Now to run our jar file, we use the following command:

$ hadoop jar ws.jar WordSearch /user/slet1/input/* /user/slet1/output the

Printing the output file gives us the following result:

$ hadoop fs -cat /user/slet1/output/*
the     802

The format is the same as when we submitted to job before, but we have added an additional argument to the end of the list, which is the phrase that we are searching for.

Same as before, to clean up the directories after having run our job, type the following:

$ hadoop fs -rm -r /user/slet1/input /user/slet1/output