This document provides a more in-depth explanation of the internals and operation of replica set features. This material is not necessary for normal operation or application development but may be useful for troubleshooting and for further understanding MongoDB’s behavior and approach.
For additional information about the internals of replication replica sets see the following resources in the MongoDB Manual:
For an explanation of the oplog, see Oplog.
All members of a replica set send heartbeats (pings) to all other members in the set and can import operations to the local oplog from any other member in the set.
Replica set oplog operations are idempotent. The following operations require idempotency:
MongoDB uses single-master replication to ensure that the database remains consistent. However, clients may modify the read preferences on a per-connection basis in order to distribute read operations to the secondary members of a replica set. Read-heavy deployments may achieve greater query throughput by distributing reads to secondary members. But keep in mind that replication is asynchronous; therefore, reads from secondaries may not always reflect the latest writes to the primary.
Use the output from db.getReplicationInfo() run on a secondary member to asses the current state of replication and determine if there is any unintended replication delay.
Replica sets can include members with the following four special configurations that affect membership behavior:
In almost every case, replica sets simplify the process of administering database replication. However, replica sets still have a unique set of administrative requirements and concerns. Choosing the right system architecture for your data set is crucial.
Administrators of replica sets also have unique monitoring and security concerns. The replica set functions in the mongo shell, provide the tools necessary for replica set administration. In particular use the rs.conf() to return a document that holds the replica set configuration and use rs.reconfig() to modify the configuration of an existing replica set.
Elections are the process replica set members use to select which member should become primary. A primary is the only member in the replica set that can accept write operations, including insert(), update(), and remove().
The following events can trigger an election:
In the default configuration, all members have an equal chance of becoming primary; however, it’s possible to set priority values that weight the election. In some architectures, there may be operational reasons for increasing the likelihood of a specific replica set member becoming primary. For instance, a member located in a remote data center should not become primary. See: Member Priority for more information.
Any member of a replica set can veto an election, even if the member is a non-voting member.
A member of the set will veto an election under the following conditions:
The first member to receive votes from a majority of members in a set becomes the next primary until the next election. Be aware of the following conditions and possible situations:
|||Remember that hidden and delayed imply secondary-only configuration.|
In MongoDB 2.0, secondaries only change sync targets if the connection to the sync target drops  or produces an error.
If you have two secondary members in one data center and a primary in a second facility, and if you start all three instances at roughly the same time (i.e. with no existing data sets or oplog), both secondaries will likely sync from the primary, as neither secondary has more recent oplog entries.
If you restart one of the secondaries, then when it rejoins the set it will likely begin syncing from the other secondary, because of proximity.
If you have a primary in one facility and a secondary in an alternate facility, and if you add another secondary to the alternate facility, the new secondary will likely sync from the existing secondary because it is closer than the primary.
In MongoDB 2.2, secondaries also use the following additional sync behaviors:
|||Secondaries will stop syncing from a member if the connection used to poll oplog entries is unresponsive for 30 seconds. If a connection times out, the member may select a new member to sync from.|
MongoDB applies write operations in batches using a multithreaded approach. The replication process divides each batch among a group of threads which apply many operations with greater concurrency.
Even though threads may apply operations out of order, a client reading data from a secondary will never return documents that reflect an in-between state that never existed on the primary. To ensure this consistency, MongoDB blocks all read operations while applying the batch of operations.
To help improve the performance of operation application, MongoDB fetches all the memory pages that hold data and indexes that the operations in the batch will affect. The prefetch stage minimizes the amount of time MongoDB must hold the write lock to apply operations. See the replIndexPrefetch setting to modify the index fetching behavior.
By default, secondaries will in most cases pre-fetch Indexes associated with the affected document to improve replication throughput.
You can limit this feature to pre-fetch only the index on the _id field, or you can disable this feature entirely. For more information, see replIndexPrefetch.