OPTIONS

Getting Started with Hadoop

MongoDB and Hadoop are a powerful combination and can be used together to deliver complex analytics and data processing for data stored in MongoDB. The following guide shows how you can start working with the MongoDB Connector for Hadoop. Once you become familiar with the connector, you can use it to pull your MongoDB data into Hadoop Map-Reduce jobs, process the data and return results back to a MongoDB collection.

Prerequisites

Hadoop

In order to use the following guide, you should already have Hadoop up and running. This can range from a deployed cluster containing multiple nodes or a single node pseudo-distributed Hadoop installation running locally. As long as you are able to run any of the examples on your Hadoop installation, you should be all set. The following versions of Hadoop are currently supported:

  • 0.20/0.20.x
  • 1.0/1.0.x
  • 1.1/1.1.x
  • 0.21/0.21.x
  • CDH3
  • CDH4

MongoDB

The latest version of MongoDB should be installed and running. In addition, the MongoDB commands should be in your $PATH.

Miscellaneous

In addition to Hadoop, you should also have git and JDK 1.6 installed.

Building MongoDB Connector for Hadoop

The MongoDB Connector for Hadoop source is available on github. First, clone the repository and get the r-1.1.0 tag:

git clone https://github.com/mongodb/mongo-hadoop.git
git checkout r1.1.0

Now, edit build.sbt and update the build target in hadoopRelease in ThisBuild. In this example, we’re using the CDH3 Hadoop distribution from Cloudera so I’ll set it as follows:

hadoopRelease in ThisBuild := "cdh3"

If you’re using some other distribution of hadoop, you’ll need to set this to the correct value for the distribution you’re using. For a list of all the support distributions and what setting to use to build for each one, check the listing in the detailed documentation on mongo-hadoop versions here

To build the connector, use the self-bootstrapping version of sbt that ships with the MongoDB-Hadoop adapter:

./sbt package

Once the connector is built, you will need to copy it and the latest stable version of the MongoDB Java driver to your $HADOOP_HOME/lib directory. For example, if you have Hadoop installed in /usr/lib/hadoop:

wget --no-check-certificate https://github.com/downloads/mongodb/mongo-java-driver/mongo-2.7.3.jar
cp mongo-2.7.3.jar /usr/lib/hadoop/lib/
cp core/target/mongo-hadoop-core_cdh3u3-1.0.0.jar /usr/lib/hadoop/lib/

Examples

Load Sample Data

The MongoDB Connector for Hadoop ships with a few examples of how to use the connector in your own setup. In this guide, we’ll focus on the Treasury Yield example. To get started, first load the sample data for these examples:

mongoimport -d mongo_hadoop -c yield_historical.in examples/treasury_yield/src/main/resources/yield_historical_in.json

To confirm that the sample data was loaded, start the mongo client and look for the mongo_hadoop database and be sure that it contains the yield_historical.in collection:

$ mongo
MongoDB shell version: 2.4.3
connecting to: test
> show dbs
mongo_hadoop    0.078GB
> use mongo_hadoop
switched to db mongo_hadoop
> show collections
system.indexes
yield_historical.in

Treasury Yield

First let’s build the Treasury Yield example:

./sbt treasury-example/package

The treasury yield example comes with a bash script to simplify setting up and running the job, and will allow you to modify the job settings without needing to re-compile the package each time. To use it, you will need to tweak the settings in the script to reflect your environment. Edit the script to set the input and output URIs for the job:

vim examples/treasury_yield/run_job.sh

In the beginning of the script, you will need to set the following variables:

  • INPUT_URI, the mongodb URI where the input data is loaded from
  • OUTPUT_URI, the mongodb URI where the job output will be written to
  • HADOOP_HOME, the location of your hadoop directory, which must contain a bin/hadoop path.

Once this is ready, you can run the script to submit the job:

bash examples/treasury_yield/run_job.sh

This job should only take a few moments as it’s a relatively small amount of data. Now check the output collection data in MongoDB to confirm that the MapReduce job was successful:

$ mongo
MongoDB shell version: 2.0.5
connecting to: test
> use mongo_hadoop
switched to db mongo_hadoop
> db.yield_historical.out.find()
{ "_id" : 1990, "value" : 8.552400000000002 }
{ "_id" : 1991, "value" : 7.8623600000000025 }
...
has more
>

Where to go from here

Read the full documentation on the MongoDB Connector for Hadoop here. To modify configuration options, you can put additional lines in the run_job.sh script to set properties passed to hadoop at runtime. Read the full comments of the script to see details on using these options to read/write from BSON as well as mongoDB collections.