The shard key determines the distribution of the collection’s documents among the cluster’s shards. The shard key is either an indexed field or an indexed compound field that exists in every document in the collection.
MongoDB partitions data in the collection using ranges of shard key values. Each range, or chunk, defines a non-overlapping range of shard key values. MongoDB distributes the chunks, and their documents, among the shards in the cluster.
Shard keys are immutable and cannot be changed after insertion. See the system limits for sharded cluster for more information.
The index on the shard key cannot be a multikey index.
Hashed Shard Keys¶
New in version 2.4.
The field you choose as your hashed shard key should have a good cardinality, or large number of different values. Hashed keys work well with fields that increase monotonically like ObjectId values or timestamps.
If you shard an empty collection using a hashed shard key, MongoDB will automatically create and migrate chunks so that each shard has two chunks. You can control how many chunks MongoDB will create with the numInitialChunks parameter to shardCollection or by manually creating chunks on the empty collection using the split command.
To shard a collection using a hashed shard key, see Shard a Collection Using a Hashed Shard Key.
MongoDB automatically computes the hashes when resolving queries using hashed indexes. Applications do not need to compute hashes.
Impacts of Shard Keys on Cluster Operations¶
The shard key affects write and query performance by determining how the MongoDB partitions data in the cluster and how effectively the mongos instances can direct operations to the cluster. Consider the following operational impacts of shard key selection:
Some possible shard keys will allow your application to take advantage of the increased write capacity that the cluster can provide, while others do not. Consider the following example where you shard by the values of the default _id field, which is ObjectId.
MongoDB generates ObjectId values upon document creation to produce a unique identifier for the object. However, the most significant bits of data in this value represent a time stamp, which means that they increment in a regular and predictable pattern. Even though this value has high cardinality, when using this, any date, or other monotonically increasing number as the shard key, all insert operations will be storing data into a single chunk, and therefore, a single shard. As a result, the write capacity of this shard will define the effective write capacity of the cluster.
A shard key that increases monotonically will not hinder performance if you have a very low insert rate, or if most of your write operations are update() operations distributed through your entire data set. Generally, choose shard keys that have both high cardinality and will distribute write operations across the entire cluster.
Typically, a computed shard key that has some amount of “randomness,” such as ones that include a cryptographic hash (i.e. MD5 or SHA1) of other content in the document, will allow the cluster to scale write operations. However, random shard keys do not typically provide query isolation, which is another important characteristic of shard keys.
New in version 2.4: MongoDB makes it possible to shard a collection on a hashed index. This can greatly improve write scaling. See Shard a Collection Using a Hashed Shard Key.
The mongos provides an interface for applications to interact with sharded clusters that hides the complexity of data partitioning. A mongos receives queries from applications, and uses metadata from the config server, to route queries to the mongod instances with the appropriate data. While the mongos succeeds in making all querying operational in sharded environments, the shard key you select can have a profound affect on query performance.
The fastest queries in a sharded environment are those that mongos will route to a single shard, using the shard key and the cluster meta data from the config server. For queries that don’t include the shard key, mongos must query all shards, wait for their response and then return the result to the application. These “scatter/gather” queries can be long running operations.
If your query includes the first component of a compound shard key , the mongos can route the query directly to a single shard, or a small number of shards, which provides better performance. Even if you query values of the shard key that reside in different chunks, the mongos will route queries directly to specific shards.
To select a shard key for a collection:
- determine the most commonly included fields in queries for a given application
- find which of these operations are most performance dependent.
If this field has low cardinality (i.e not sufficiently selective) you should add a second field to the shard key making a compound shard key. The data may become more splittable with a compound shard key.
Sharded Cluster Query Routing for more information on query operations in the context of sharded clusters.
|||In many ways, you can think of the shard key a cluster-wide index. However, be aware that sharded systems cannot enforce cluster-wide unique indexes unless the unique field is in the shard key. Consider the Index Concepts page for more information on indexes and compound indexes.|