Troubleshoot Sharded Clusters
This page describes common strategies for troubleshooting
sharded cluster deployments.
Application Servers or mongos
Instances Become Unavailable
If each application server has its own mongos
instance, other
application servers can continue to access the database. Furthermore,
mongos
instances do not maintain persistent state, and they
can restart and become unavailable without losing any state or data.
When a mongos
instance starts, it retrieves a copy of the
config database and can begin routing queries.
A Single Member Becomes Unavailable in a Shard Replica Set
Replica sets provide high availability for
shards. If the unavailable mongod
is a primary,
then the replica set will elect a new
primary. If the unavailable mongod
is a
secondary, and it disconnects the primary and secondary will
continue to hold all data. In a three member replica set, even if a
single member of the set experiences catastrophic failure, two other
members have full copies of the data.
Always investigate availability interruptions and failures. If a system
is unrecoverable, replace it and create a new member of the replica set
as soon as possible to replace the lost redundancy.
All Members of a Shard Become Unavailable
In a sharded cluster, mongod
and mongos
instances
monitor the replica sets in the sharded cluster (e.g. shard replica
sets, config server replica set).
If all members of a replica set shard are unavailable, all data held in
that shard is unavailable. However, the data on all other shards will
remain available, and it is possible to read and write data to the
other shards. However, your application must be able to deal with
partial results, and you should investigate the cause of the
interruption and attempt to recover the shard as soon as possible.
A Config Server Replica Set Member Become Unavailable
Changed in version 3.2: Starting in MongoDB 3.2, config servers for sharded clusters can be
deployed as a replica set. The
replica set config servers must run the WiredTiger storage engine. MongoDB 3.2 deprecates the use of three mirrored
mongod
instances for config servers.
Replica sets provide high availability for
the config servers. If an unavailable config server is a
primary, then the replica set will
elect a new primary.
If the replica set config server loses its primary and cannot elect a
primary, the cluster’s metadata becomes read only. You can still read
and write data from the shards, but no chunk migration or chunk splits will occur until a primary
is available.
Note
Distributing replica set members across two data centers provides
benefit over a single data center. In a two data center
distribution,
- If one of the data centers goes down, the data is still available
for reads unlike a single data center distribution.
- If the data center with a minority of the members goes down, the
replica set can still serve write operations as well as read
operations.
- However, if the data center with the majority of the members goes
down, the replica set becomes read-only.
If possible, distribute members across at least three data centers.
For config server replica sets (CSRS), the best practice is to
distribute across three (or more depending on the number of members)
centers. If the cost of the third data center is prohibitive, one
distribution possibility is to evenly distribute the data bearing
members across the two data centers and store the remaining member
in the cloud if your company policy allows.
Note
All config servers must be running and available when you first initiate
a sharded cluster.
MongoDB 3.2
In a sharded cluster, mongod
and mongos
instances
monitor the replica sets in the sharded cluster (e.g. shard replica
sets, config server replica set).
If no member of the config server replica set is reachable by a
mongos
instance or a shard member:
- For MongoDB 3.2.0-3.2.9,
If the number of consecutive unsuccessful attempts exceeds
replMonitorMaxFailedChecks
parameter value, the
mongod
or mongos
instance denotes the
replica set as unavailable and the monitoring
mongos
or mongod
instance becomes unusable
until you restart the instance.
However, you can avoid having to restart the mongos
or
the shard replica set member mongod
in this situation by
setting replMonitorMaxFailedChecks
to value
2147483647
when you start up these instances:
Important
The parameter setting is not persisted upon restart.
- For MongoDB 3.2.10 or later 3.2-series,
- By default, you do not need to restart unless
timeOutMonitoringReplicaSets
is set to true
.
Cursor Fails Because of Stale Config Data
A query returns the following warning when one or more of the
mongos
instances has not yet updated its cache of the
cluster’s metadata from the config database:
This warning should not propagate back to your application. The
warning will repeat until all the mongos
instances refresh
their caches. To force an instance to refresh its cache, run the
flushRouterConfig
command.
Shard Keys and Cluster Availability
The most important consideration when choosing a shard key
are:
- to ensure that MongoDB will be able to distribute data evenly among
shards, and
- to scale writes across the cluster, and
- to ensure that
mongos
can isolate most queries to a specific
mongod
.
Furthermore:
- Each shard should be a replica set, if a specific
mongod
instance fails, the replica set members will elect
another to be primary and continue operation. However, if an
entire shard is unreachable or fails for some reason, that data will
be unavailable.
- If the shard key allows the
mongos
to isolate most
operations to a single shard, then the failure of a single shard
will only render some data unavailable.
- If your shard key distributes data required for every operation
throughout the cluster, then the failure of the entire shard will
render the entire cluster unavailable.
In essence, this concern for reliability simply underscores the
importance of choosing a shard key that isolates query operations to a
single shard.
Config Database String Error
Starting in MongoDB 3.2, config servers can be deployed as replica
sets. The mongos
instances for the sharded cluster must
specify the same config server replica set name but can specify
hostname and port of different members of the replica set.
Starting in 3.4, the use of the deprecated mirrored mongod
instances as config servers (SCCC) is no longer supported. Before you
can upgrade your sharded clusters to 3.4, you must convert your config
servers from SCCC to CSRS.
To convert your config servers from SCCC to CSRS, see the MongoDB 3.4
manual Upgrade Config Servers to Replica Set.
With earlier versions of MongoDB sharded clusters that use the topology
of three mirrored mongod
instances for config servers,
mongos
instances in a sharded cluster must specify identical
configDB
string.
Avoid Downtime when Moving Config Servers
Use CNAMEs to identify your config servers to the cluster so
that you can rename and renumber your config servers without downtime.
moveChunk commit failed
Error
At the end of a chunk migration, the
shard must connect to the config database to update the
chunk’s record in the cluster metadata. If the shard fails to
connect to the config database, MongoDB reports the following
error:
When this happens, the primary member of the shard’s replica
set then terminates to protect data consistency. If a secondary
member can access the config database, data on the shard becomes
accessible again after an election.
The user will need to resolve the chunk migration failure
independently. If you encounter this issue, ask the MongoDB
Community or
MongoDB Support to address this issue.