OPTIONS

Getting Started with Hadoop

MongoDB and Hadoop are a powerful combination and can be used together to deliver complex analytics and data processing for data stored in MongoDB. The following guide shows how you can start working with the MongoDB Connector for Hadoop. Once you become familiar with the connector, you can use it to pull your MongoDB data into Hadoop Map-Reduce jobs, process the data and return results back to a MongoDB collection.

Prerequisites

Hadoop

In order to use the following guide, you should already have Hadoop up and running. This can range from a deployed cluster containing multiple nodes or a single node pseudo-distributed Hadoop installation running locally. As long as you are able to run any of the examples on your Hadoop installation, you should be all set. The following versions of Hadoop are currently supported:

  • 0.23
  • 1.0
  • 1.1
  • 2.2
  • 2.3
  • CDH4

MongoDB

Install and run the latest version of MongoDB. In addition, the MongoDB commands should be in your system search path (i.e. $PATH).

If your mongod requires authorization, [#auth-required]_ the user account used by the Hadoop connector must have the:authaction:splitVector privilege.

To run the examples in this document, use a user that has the clusterManager role. Add these privileges to an existing reportsUser user with the following operation:

use mongo_hadoop

db.updateUser( "reportsUser",
 {
  roles : [
           { role: "readWrite", db: "mongo_hadoop" },
           { role : "clusterManager", db : "admin" }
          ]
 }
)

In practice, you should create a custom role with appropriate privileges for the user account used by the Hadoop connector, as the clusterManager role provides additional privileges. See User and Role Management Tutorials.

Miscellaneous

In addition to Hadoop, you should also have git and JDK 1.6 installed.

Building MongoDB Connector for Hadoop

The MongoDB Connector for Hadoop source is available on github. First, clone the repository. If you want build a specific version you can check out the tag for that version. For example, to build the 1.2.0 version, get the r1.2.0 tag:

git clone https://github.com/mongodb/mongo-hadoop.git
git checkout r1.2.0

The hadoop connector use gradle to build. By default, the connector will build with the latest Apache Hadoop release (currently 2.3). If you need to build for a different version, simply pass -Phadoop_version=<your version> when building. For example, to build against the CDH 4 distribution libraries, you would run this command:

./gradlew jar -Phadoop_version=cdh4

For a list of all the support distributions and what setting to use to build for each one, check the listing in the detailed documentation on mongo-hadoop versions here

Once the connector is built, you will need to copy it and the latest stable version of the MongoDB Java driver to your $HADOOP_HOME/lib directory. For example, if you have Hadoop installed in /usr/lib/hadoop:

wget --no-check-certificate https://github.com/downloads/mongodb/mongo-java-driver/mongo-2.11.4.jar
cp mongo-2.11.4.jar /usr/lib/hadoop/lib/
cp core/build/libs/mongo-hadoop-core-1.2.0-hadoop_2.3.jar /usr/lib/hadoop/lib/

Examples

The MongoDB Connector for Hadoop ships with a few examples of how to use the connector in your own setup. In this guide, we’ll focus on the Treasury Yield example. This example can be run via gradle with the following command:

./gradlew historicalYield

You should see a lot of data scroll past. To understand what’s going on, let’s break down the steps. The first thing this task does is to import examples/treasury_yield/src/main/resources/yield_historical_in.json in to mongo. You can view that data as shown below:

$ mongo mongo_hadoop
MongoDB shell version: 2.4.9
connecting to: mongo_hadoop
> show collections
system.indexes
yield_historical.in
yield_historical.out
> db.yield_historical.in.find()
{ "_id" : ISODate("1990-01-02T00:00:00Z"), "dayOfWeek" : "TUESDAY", "bc3Year" : 7.9, "bc5Year" : 7.87, "bc10Year" : 7.94, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 7.87, "bc3Month" : 7.83, "bc30Year" : 8, "bc1Year" : 7.81, "bc7Year" : 7.98, "bc6Month" : 7.89 }
{ "_id" : ISODate("1990-01-03T00:00:00Z"), "dayOfWeek" : "WEDNESDAY", "bc3Year" : 7.96, "bc5Year" : 7.92, "bc10Year" : 7.99, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 7.94, "bc3Month" : 7.89, "bc30Year" : 8.04, "bc1Year" : 7.85, "bc7Year" : 8.04, "bc6Month" : 7.94 }
...
has more
>

When you run the examples, gradle will download the appropriate hadoop distribution bundle, extract it, and copy over the appropriate dependencies for you. Subsequent runs will not redownload those files so it’s safe to switch between versions of hadoop. Gradle will manage all that for you so you need not worry about any of those details. It is recommend to run ./gradlew clean when changing versions of hadoop to ensure everything is built against the correct versions of libraries.

Where to go from here

Read the full documentation on the MongoDB Connector for Hadoop here. To modify configuration options, you can put additional lines in the historicalYield task to set properties passed to hadoop at runtime. Read the full comments of the script to see details on using these options to read/write from BSON as well as mongoDB collections.