- Sharding >
- Manage Chunks in a Sharded Cluster
Manage Chunks in a Sharded Cluster¶
On this page
This page describes various operations on chunks in sharded clusters. MongoDB automates most chunk management operations. However, these chunk management operations are accessible to administrators for use in some situations, typically surrounding initial setup, deployment, and data ingestion.
Split Chunks¶
Normally, MongoDB splits a chunk following inserts when a
chunk exceeds the chunk size. The
balancer may migrate recently split chunks to a new shard
immediately if mongos
predicts future insertions will
benefit from the move.
MongoDB treats all chunks the same, whether split manually or automatically by the system.
Warning
You cannot merge or combine chunks once you have split them.
You may want to split chunks manually if:
- you have a large amount of data in your cluster and very few chunks, as is the case after deploying a cluster using existing data.
- you expect to add a large amount of data that would initially reside in a single chunk or shard.
Example
You plan to insert a large amount of data with shard key
values between 300
and 400
, but all values of your shard
keys are between 250
and 500
are in a single chunk.
Warning
Be careful when splitting data in a sharded collection to create new chunks. When you shard a collection that has existing data, MongoDB automatically creates chunks to evenly distribute the collection. To split data effectively in a sharded cluster you must consider the number of documents in a chunk and the average document size to create a uniform chunk size. When chunks have irregular sizes, shards may have an equal number of chunks but have very different data sizes. Avoid creating splits that lead to a collection with differently sized chunks.
Use sh.status()
to determine the current chunks ranges across
the cluster.
To split chunks manually, use the split
command with
operators: middle
and find
. The equivalent shell helpers are
sh.splitAt()
or sh.splitFind()
.
Example
The following command will split the chunk that contains
the value of 63109
for the zipcode
field in the people
collection of the records
database:
sh.splitFind()
will split the chunk that contains the
first document returned that matches this query into two equally
sized chunks. You must specify the full namespace
(i.e. “<database>.<collection>
”) of the sharded collection to
sh.splitFind()
. The query in sh.splitFind()
need
not contain the shard key, though it almost always makes sense to
query for the shard key in this case, and including the shard key will
expedite the operation.
Use sh.splitAt()
to split a chunk in two using the queried
document as the partition point:
However, the location of the document that this query finds with respect to the other documents in the chunk does not affect how the chunk splits.
Create Chunks (Pre-Splitting)¶
Pre-splitting the chunk ranges in an empty sharded collection, allows clients to insert data into an already-partitioned collection. In most situations a sharded cluster will create and distribute chunks automatically without user intervention. However, in a limited number of use profiles, MongoDB cannot create enough chunks or distribute data fast enough to support required throughput. For example, if:
you must partition an existing data collection that resides on a single shard.
you must ingest a large volume of data into a cluster that isn’t balanced, or where the ingestion of data will lead to an imbalance of data.
This can arise in an initial data loading, or in a case where you must insert a large volume of data into a single chunk, as is the case when you must insert at the beginning or end of the chunk range, as is the case for monotonically increasing or decreasing shard keys.
Preemptively splitting chunks increases cluster throughput for these operations, by reducing the overhead of migrating chunks that hold data during the write operation. MongoDB only creates splits after an insert operation and can migrate only a single chunk at a time. Chunk migrations are resource intensive and further complicated by large write volume to the migrating chunk.
Warning
You can only pre-split an empty collection. When you enable sharding for a collection that contains data MongoDB automatically creates splits. Subsequent attempts to create splits manually, can lead to unpredictable chunk ranges and sizes as well as inefficient or ineffective balancing behavior.
To create and migrate chunks manually, use the following procedure:
Split empty chunks in your collection by manually performing
split
command on chunks.Migrate chunks manually using the
moveChunk
command:Example
To migrate all of the manually created user profiles evenly, putting each prefix chunk on the next shard from the other, run the following commands in the mongo shell:
You can also let the balancer automatically distribute the new chunks. For an introduction to balancing, see Shard Balancing. For lower level information on balancing, see Cluster Balancer.
Modify Chunk Size¶
When you initialize a sharded cluster, [1] the default chunk size is 64 megabytes. This default chunk size works well for most deployments; however, if you notice that automatic migrations are incurring a level of I/O that your hardware cannot handle, you may want to reduce the chunk size. For the automatic splits and migrations, a small chunk size leads to more rapid and frequent migrations.
to modify the chunk size, use the following procedure:
issue the following command to switch to the Config Database Contents:
Issue the following
save()
operation:Where the value of
<size>
reflects the new chunk size in megabytes. Here, you’re essentially writing a document whose values store the global chunk size configuration value.
Note
The chunkSize
and
--chunkSize
options, passed at runtime to the mongos
do not affect the chunk size after you have initialized the
cluster. [1]
To eliminate confusion you should always set chunk size using the above procedure and never use the runtime options.
Modifying the chunk size has several limitations:
- Automatic splitting only occurs when inserting documents or updating existing documents.
- If you lower the chunk size it may take time for all chunks to split to the new size.
- Splits cannot be “undone.”
If you increase the chunk size, existing chunks must grow through insertion or updates until they reach the new size.
[1] | (1, 2) The first mongos that connects
to a set of config servers initializes
the sharded cluster. |
Migrate Chunks¶
In most circumstances, you should let the automatic balancer migrate chunks between shards. However, you may want to migrate chunks manually in a few cases:
- If you create chunks by pre-splitting the data in your collection, you will have to migrate chunks manually to distribute chunks evenly across the shards. Use pre-splitting in limited situations, to support bulk data ingestion.
- If the balancer in an active cluster cannot distribute chunks within the balancing window, then you will have to migrate chunks manually.
For more information on how chunks move between shards, see Cluster Balancer, in particular the section Chunk Migration.
To migrate chunks, use the moveChunk
command.
Note
To return a list of shards, use the listShards
command.
Specify shard names using the addShard
command
using the name
argument. If you do not specify a name in the
addShard
command, MongoDB will assign a name
automatically.
The following example assumes that the field username
is the
shard key for a collection named users
in the myapp
database, and that the value smith
exists within the chunk
you want to migrate.
To move this chunk, you would issue the following command from a mongo
shell connected to any mongos
instance.
This command moves the chunk that includes the shard key value “smith” to the
shard named mongodb-shard3.example.net
. The command will
block until the migration is complete.
See Create Chunks (Pre-Splitting) for an introduction to pre-splitting.
New in version 2.2: moveChunk
command has the: _secondaryThrottle
parameter. When set to true
, MongoDB ensures that changes to
shards as part of chunk migrations replicate to secondaries throughout the migration operation. For more
information, see Require Replication before Chunk Migration (Secondary Throttle).
Strategies for Bulk Inserts in Sharded Clusters¶
Large bulk insert operations, including initial data ingestion or routine data import, can have a significant impact on a sharded cluster. For bulk insert operations, consider the following strategies:
If the collection does not have data, then there is only one chunk, which must reside on a single shard. MongoDB must receive data, create splits, and distribute chunks to the available shards. To avoid this performance cost, you can pre-split the collection, as described in Create Chunks (Pre-Splitting).
You can parallelize import processes by sending insert operations to more than one
mongos
instance. If the collection is empty, pre-split first, as described in Create Chunks (Pre-Splitting).If your shard key increases monotonically during an insert then all the inserts will go to the last chunk in the collection, which will always end up on a single shard. Therefore, the insert capacity of the cluster will never exceed the insert capacity of a single shard.
If your insert volume is never larger than what a single shard can process, then there is no problem; however, if the insert volume exceeds that range, and you cannot avoid a monotonically increasing shard key, then consider the following modifications to your application:
- Reverse all the bits of the shard key to preserve the information while avoiding the correlation of insertion order and increasing sequence of values.
- Swap the first and last 16-bit words to “shuffle” the inserts.
Example
The following example, in C++, swaps the leading and trailing 16-bit word of BSON ObjectIds generated so that they are no longer monotonically increasing.
For information on choosing a shard key, see Shard Key Selection and see Shard Key Internals (in particular, Operations and Reliability and Choosing a Shard Key).
Note
For bulk inserts on sharded clusters, the getLastError
command alone is insufficient to verify success. Applications
should must verify the success of bulk inserts in application
logic.