OPTIONS

Bulk Inserts in MongoDB

In some situations you may need to insert or ingest a large amount of data into a MongoDB database. These bulk inserts have some special considerations that are different from other write operations.

Use the insert() Method

The insert() method, when passed an array of documents, performs a bulk insert, and inserts each document atomically. Bulk inserts can significantly increase performance by amortizing write concern costs.

New in version 2.2: insert() in the mongo shell gained support for bulk inserts in version 2.2.

In the drivers, you can configure write concern for batches rather than on a per-document level.

Drivers have a ContinueOnError option in their insert operation, so that the bulk operation will continue to insert remaining documents in a batch even if an insert fails.

Note

If multiple errors occur during a bulk insert, clients only receive the last error generated.

See also

Driver documentation for details on performing bulk inserts in your application. Also see Import and Export MongoDB Data.

Bulk Inserts on Sharded Clusters

While ContinueOnError is optional on unsharded clusters, all bulk operations to a sharded collection run with ContinueOnError, which cannot be disabled.

Large bulk insert operations, including initial data inserts or routine data import, can affect sharded cluster performance. For bulk inserts, consider the following strategies:

Pre-Split the Collection

If the sharded collection is empty, then the collection has only one initial chunk, which resides on a single shard. MongoDB must then take time to receive data, create splits, and distribute the split chunks to the available shards. To avoid this performance cost, you can pre-split the collection, as described in Split Chunks in a Sharded Cluster.

Insert to Multiple mongos

To parallelize import processes, send insert operations to more than one mongos instance. Pre-split empty collections first as described in Split Chunks in a Sharded Cluster.

Avoid Monotonic Throttling

If your shard key increases monotonically during an insert, then all inserted data goes to the last chunk in the collection, which will always end up on a single shard. Therefore, the insert capacity of the cluster will never exceed the insert capacity of that single shard.

If your insert volume is larger than what a single shard can process, and if you cannot avoid a monotonically increasing shard key, then consider the following modifications to your application:

  • Reverse the binary bits of the shard key. This preserves the information and avoids correlating insertion order with increasing sequence of values.
  • Swap the first and last 16-bit words to “shuffle” the inserts.

Example

The following example, in C++, swaps the leading and trailing 16-bit word of BSON ObjectIds generated so that they are no longer monotonically increasing.

using namespace mongo;
OID make_an_id() {
  OID x = OID::gen();
  const unsigned char *p = x.getData();
  swap( (unsigned short&) p[0], (unsigned short&) p[10] );
  return x;
}

void foo() {
  // create an object
  BSONObj o = BSON( "_id" << make_an_id() << "x" << 3 << "name" << "jane" );
  // now we may insert o into a sharded collection
}

See also

Shard Keys for information on choosing a sharded key. Also see Shard Key Internals (in particular, Choosing a Shard Key).