OPTIONS

Getting Started with Hadoop

MongoDB and Hadoop are a powerful combination and can be used together to deliver complex analytics and data processing for data stored in MongoDB. The following guide shows how you can start working with the MongoDB Connector for Hadoop. Once you become familiar with the connector, you can use it to pull your MongoDB data into Hadoop Map-Reduce jobs, process the data and return results back to a MongoDB collection.

Prerequisites

Hadoop

In order to use the following guide, you should already have Hadoop up and running. This can range from a deployed cluster containing multiple nodes or a single node pseudo-distributed Hadoop installation running locally. As long as you are able to run any of the examples on your Hadoop installation, you should be all set. The following versions of Hadoop are currently supported:

  • 0.23
  • 1.0
  • 1.1
  • 2.2
  • 2.3
  • 2.4
  • CDH4
  • CDH5

MongoDB

Install and run the latest version of MongoDB. In addition, the MongoDB commands should be in your system search path (i.e. $PATH).

If your mongod requires authorization [1], the user account used by the Hadoop connector must have the:authaction:splitVector privilege.

To run the examples in this document, use a user that has the clusterManager role. Add these privileges to an existing reportsUser user with the following operation:

use mongo_hadoop

db.updateUser( "reportsUser",
 {
  roles : [
           { role: "readWrite", db: "mongo_hadoop" },
           { role : "clusterManager", db : "admin" }
          ]
 }
)

In practice, you should create a custom role with appropriate privileges for the user account used by the Hadoop connector, as the clusterManager role provides additional privileges. See User and Role Management Tutorials.

[1]mongod requires clients to authenticate as user when you specify the security.authorization or security.keyFile settings or the corresponding --auth or --keyFile command line options.

If your mongod has authorization enforcement turned on (i.e. is running with the option --auth and/or --keyFile specified), the user account used by the Hadoop connector must have the splitVector privilege.

For the purpose of running the examples, this may be accomplished by adding the clusterManager role to the user reportsUser:

use mongo_hadoop

db.updateUser( "reportsUser",
 {
  roles : [
           { role: "readWrite", db: "mongo_hadoop" },
           { role : "clusterManager", db : "admin" }
          ]
 }
)

In practice, you should create a custom role with appropriate privileges for the user account used by the Hadoop connector, as clusterManager has elevated privileges. See User and Role Management Tutorials.

Miscellaneous

In addition to Hadoop, you should also have git, python, and JDK 1.6 installed.

Getting the Hadoop Connector

The MongoDB Connector for Hadoop source code and pre-built JARs are all available on its Github page along with specific installation instructions in the wiki.

Examples

The MongoDB Connector for Hadoop ships with a few examples of how to use the connector in your own setup. In this guide, we’ll focus on the Treasury Yield example. This example can be run via gradle with the following command:

./gradlew jar testJar historicalYield

You should see a lot of data scroll past. To understand what’s going on, let’s break down the steps. The first thing this task does is to import examples/treasury_yield/src/main/resources/yield_historical_in.json in to mongo. You can view that data as shown below:

$ mongo mongo_hadoop
MongoDB shell version: 2.4.9
connecting to: mongo_hadoop
> show collections
system.indexes
yield_historical.in
yield_historical.out
> db.yield_historical.in.find()
{ "_id" : ISODate("1990-01-02T00:00:00Z"), "dayOfWeek" : "TUESDAY", "bc3Year" : 7.9, "bc5Year" : 7.87, "bc10Year" : 7.94, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 7.87, "bc3Month" : 7.83, "bc30Year" : 8, "bc1Year" : 7.81, "bc7Year" : 7.98, "bc6Month" : 7.89 }
{ "_id" : ISODate("1990-01-03T00:00:00Z"), "dayOfWeek" : "WEDNESDAY", "bc3Year" : 7.96, "bc5Year" : 7.92, "bc10Year" : 7.99, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 7.94, "bc3Month" : 7.89, "bc30Year" : 8.04, "bc1Year" : 7.85, "bc7Year" : 8.04, "bc6Month" : 7.94 }
...
has more
>

When you run the examples, gradle will download the appropriate hadoop distribution bundle, extract it, and copy over the appropriate dependencies for you. Subsequent runs will not redownload those files so it’s safe to switch between versions of hadoop. Gradle will manage all that for you so you need not worry about any of those details. It is recommend to run ./gradlew clean when changing versions of hadoop to ensure everything is built against the correct versions of libraries.

Where to go from here

Read the full documentation on the MongoDB Connector for Hadoop here. To modify configuration options, you can put additional lines in the historicalYield task to set properties passed to hadoop at runtime. Read the full comments of the script to see details on using these options to read/write from BSON as well as mongoDB collections.