OPTIONS

Shard Keys

The shard key determines the distribution of the collection’s documents among the cluster’s shards. The shard key is either an indexed field or an indexed compound field that exists in every document in the collection.

MongoDB partitions data in the collection using ranges of shard key values. Each range, or chunk, defines a non-overlapping range of shard key values. MongoDB distributes the chunks, and their documents, among the shards in the cluster.

Diagram of the shard key value space segmented into smaller ranges or chunks.

Diagram of the shard key value space segmented into smaller ranges or chunks.

When a chunk grows beyond the chunk size, MongoDB splits the chunk into smaller chunks, always based on ranges in the shard key.

Considerations

Shard keys are immutable and cannot be changed after insertion. See the system limits for sharded cluster for more information.

The index on the shard key cannot be a multikey index.

Hashed Shard Keys

New in version 2.4.

Hashed shard keys use a hashed index of a single field as the shard key to partition data across your sharded cluster.

The field you choose as your hashed shard key should have a good cardinality, or large number of different values. Hashed keys work well with fields that increase monotonically like ObjectId values or timestamps.

If you shard an empty collection using a hashed shard key, MongoDB will automatically create and migrate chunks so that each shard has two chunks. You can control how many chunks MongoDB will create with the numInitialChunks parameter to shardCollection or by manually creating chunks on the empty collection using the split command.

To shard a collection using a hashed shard key, see Shard a Collection Using a Hashed Shard Key.

Tip

MongoDB automatically computes the hashes when resolving queries using hashed indexes. Applications do not need to compute hashes.

Impacts of Shard Keys on Cluster Operations

The shard key affects write and query performance by determining how the MongoDB partitions data in the cluster and how effectively the mongos instances can direct operations to the cluster. Consider the following operational impacts of shard key selection:

Write Scaling

Some possible shard keys will allow your application to take advantage of the increased write capacity that the cluster can provide, while others do not. Consider the following example where you shard by the values of the default _id field, which is ObjectId.

MongoDB generates ObjectId values upon document creation to produce a unique identifier for the object. However, the most significant bits of data in this value represent a time stamp, which means that they increment in a regular and predictable pattern. Even though this value has high cardinality, when using this, any date, or other monotonically increasing number as the shard key, all insert operations will be storing data into a single chunk, and therefore, a single shard. As a result, the write capacity of this shard will define the effective write capacity of the cluster.

A shard key that increases monotonically will not hinder performance if you have a very low insert rate, or if most of your write operations are update() operations distributed through your entire data set. Generally, choose shard keys that have both high cardinality and will distribute write operations across the entire cluster.

Typically, a computed shard key that has some amount of “randomness,” such as ones that include a cryptographic hash (i.e. MD5 or SHA1) of other content in the document, will allow the cluster to scale write operations. However, random shard keys do not typically provide query isolation, which is another important characteristic of shard keys.

New in version 2.4: MongoDB makes it possible to shard a collection on a hashed index. This can greatly improve write scaling. See Shard a Collection Using a Hashed Shard Key.

Querying

The mongos provides an interface for applications to interact with sharded clusters that hides the complexity of data partitioning. A mongos receives queries from applications, and uses metadata from the config server, to route queries to the mongod instances with the appropriate data. While the mongos succeeds in making all querying operational in sharded environments, the shard key you select can have a profound affect on query performance.

See also

The Sharded Cluster Query Routing and config server sections for a more general overview of querying in sharded environments.

Query Isolation

The fastest queries in a sharded environment are those that mongos will route to a single shard, using the shard key and the cluster meta data from the config server. For queries that don’t include the shard key, mongos must query all shards, wait for their response and then return the result to the application. These “scatter/gather” queries can be long running operations.

If your query includes the first component of a compound shard key [1], the mongos can route the query directly to a single shard, or a small number of shards, which provides better performance. Even if you query values of the shard key that reside in different chunks, the mongos will route queries directly to specific shards.

To select a shard key for a collection:

  • determine the most commonly included fields in queries for a given application
  • find which of these operations are most performance dependent.

If this field has low cardinality (i.e not sufficiently selective) you should add a second field to the shard key making a compound shard key. The data may become more splittable with a compound shard key.

See

Sharded Cluster Query Routing for more information on query operations in the context of sharded clusters.

[1]In many ways, you can think of the shard key a cluster-wide unique index. However, be aware that sharded systems cannot enforce cluster-wide unique indexes unless the unique field is in the shard key. Consider the Index Concepts page for more information on indexes and compound indexes.

Sorting

In sharded systems, the mongos performs a merge-sort of all sorted query results from the shards. See Sharded Cluster Query Routing and Use Indexes to Sort Query Results for more information.