New in version 2.1.
The MongoDB aggregation framework provides a means to calculate aggregated values without having to use map-reduce. While map-reduce is powerful, it is often more difficult than necessary for many simple aggregation tasks, such as totaling or averaging field values.
If you’re familiar with SQL, the aggregation framework provides similar functionality to GROUP BY and related SQL operators as well as simple forms of “self joins.” Additionally, the aggregation framework provides projection capabilities to reshape the returned data. Using the projections in the aggregation framework, you can add computed fields, create new virtual sub-objects, and extract sub-fields into the top-level of results.
See also
Consider Aggregation Framework Examples and Aggregation Framework Reference for more documentation.
This section provides an introduction to the two concepts that underpin the aggregation framework: pipelines and expressions.
Conceptually, documents from a collection pass through an aggregation pipeline, which transforms these objects as they pass through. For those familiar with UNIX-like shells (e.g. bash,) the concept is analogous to the pipe (i.e. |) used to string text filters together.
In a shell environment the pipe redirects a stream of characters from the output of one process to the input of the next. The MongoDB aggregation pipeline streams MongoDB documents from one pipeline operator to the next to process the documents. Pipeline operators can be repeated in the pipe.
All pipeline operators process a stream of documents and the pipeline behaves as if the operation scans a collection and passes all matching documents into the “top” of the pipeline. Each operator in the pipeline transforms each document as it passes through the pipeline.
Note
Pipeline operators need not produce one output document for every input document: operators may also generate new documents or filter out documents.
Warning
The pipeline cannot operate on values of the following types: Binary, Symbol, MinKey, MaxKey, DBRef, Code, and CodeWScope.
Expressions produce output documents based on calculations performed on input documents. The aggregation framework defines expressions using a document format using prefixes.
Expressions are stateless and are only evaluated when seen by the aggregation process. All aggregation expressions can only operate on the current document in the pipeline, and cannot integrate data from other documents.
The accumulator expressions used in the $group operator maintain that state (e.g. totals, maximums, minimums, and related data) as documents progress through the pipeline.
See also
Aggregation expressions for additional examples of the expressions provided by the aggregation framework.
Invoke an aggregation operation with the aggregate() wrapper in the mongo shell or the aggregate database command. Always call aggregate() on a collection object that determines the input documents of the aggregation pipeline. The arguments to the aggregate() method specify a sequence of pipeline operators, where each operator may have a number of operands.
First, consider a collection of documents named articles using the following format:
{
title : "this is my title" ,
author : "bob" ,
posted : new Date () ,
pageViews : 5 ,
tags : [ "fun" , "good" , "fun" ] ,
comments : [
{ author :"joe" , text : "this is cool" } ,
{ author :"sam" , text : "this is bad" }
],
other : { foo : 5 }
}
The following example aggregation operation pivots data to create a set of author names grouped by tags applied to an article. Call the aggregation framework by issuing the following command:
db.articles.aggregate(
{ $project : {
author : 1,
tags : 1,
} },
{ $unwind : "$tags" },
{ $group : {
_id : { tags : "$tags" },
authors : { $addToSet : "$author" }
} }
);
The aggregation pipeline begins with the collection articles and selects the author and tags fields using the $project aggregation operator. The $unwind operator produces one output document per tag. Finally, the $group operator pivots these fields.
The aggregation operation in the previous section returns a document with two fields:
Changed in version 2.4: If an error occurs, the aggregate() helper throws an exception. In previous versions, the helper returned a document with the error message and code, and ok status field not equal to 1, same as the aggregate command.
As a document, the result is subject to the BSON Document size limit, which is currently 16 megabytes.
Because you will always call aggregate on a collection object, which logically inserts the entire collection into the aggregation pipeline, you may want to optimize the operation by avoiding scanning the entire collection whenever possible.
Depending on the order in which they appear in the pipeline, aggregation operators can take advantage of indexes.
The following pipeline operators take advantage of an index when they occur at the beginning of the pipeline:
The above operators can also use an index when placed before the following aggregation operators:
New in version 2.4.
The $geoNear pipeline operator takes advantage of a geospatial index. When using $geoNear, the $geoNear pipeline operation must appear as the first stage in an aggregation pipeline.
If your aggregation operation requires only a subset of the data in a collection, use the $match operator to restrict which items go in to the top of the pipeline, as in a query. When placed early in a pipeline, these $match operations use suitable indexes to scan only the matching documents in a collection.
Placing a $match pipeline stage followed by a $sort stage at the start of the pipeline is logically equivalent to a single query with a sort, and can use an index.
In future versions there may be an optimization phase in the pipeline that reorders the operations to increase performance without affecting the result. However, at this time place $match operators at the beginning of the pipeline when possible.
Changed in version 2.4.
Aggregation operations have an optimization phase which attempts to re-arrange the pipeline for improved performance.
When you have sequence of $sort followed by a $skip followed by a $limit, an optimization occurs whereby the $limit moves in front of the $skip. For example, if the pipeline consists of the following stages:
{ $sort: { age : -1 } },
{ $skip: 10 },
{ $limit: 5 }
During the optimization phase, the optimizer transforms the sequence to the following:
{ $sort: { age : -1 } },
{ $limit: 15 }
{ $skip: 10 }
When you have continuous sequence of $limit pipeline stage followed by a $skip pipeline stage, the aggregation will attempt to re-arrange the pipeline stages to combine the limits together and the skips together. For example, if the pipeline consists of the following stages:
{ $limit: 100 },
{ $skip: 5 },
{ $limit: 10},
{ $skip: 2 }
During the intermediate step, the optimizer reverses the position of the $skip followed by a $limit to $limit followed by the $skip.
{ $limit: 100 },
{ $limit: 15},
{ $skip: 5 },
{ $skip: 2 }
The $limit value has increased to the sum of the initial value and the $skip value. Then, for the final $limit value, the optimizer selects the minimum between the adjacent $limit values. For the final $skip value, the optimizer adds the adjacent $skip values, to transform the sequence to the following:
{ $limit: 15 },
{ $skip: 7 }
Certain pipeline operators require access to the entire input set before they can produce any output. For example, $sort must receive all of the input from the preceding pipeline operator before it can produce its first output document. The current implementation of $sort does not go to disk in these cases: in order to sort the contents of the pipeline, the entire input must fit in memory.
Changed in version 2.4: When a $sort immediately precedes a $limit in the pipeline, the $sort operation only maintains the top n results as it progresses, where n is the specified limit. Before 2.4, $sort would sort all the results in memory, and then limit the results to n results.
$group has similar characteristics: Before any $group passes its output along the pipeline, it must receive the entirety of its input. For the $group operator, this frequently does not require as much memory as $sort, because it only needs to retain one record for each unique key in the grouping specification.
The current implementation of the aggregation framework logs a warning if a cumulative operator consumes 5% or more of the physical memory on the host. Cumulative operators produce an error if they consume 10% or more of the physical memory on the host.
Note
Changed in version 2.1.
Some aggregation operations using aggregate will cause mongos instances to require more CPU resources than in previous versions. This modified performance profile may dictate alternate architectural decisions if you use the aggregation framework extensively in a sharded environment.
The aggregation framework is compatible with sharded collections.
When operating on a sharded collection, the aggregation pipeline is split into two parts. The aggregation framework pushes all of the operators up to the first $group or $sort operation to each shard. [1] Then, a second pipeline on the mongos runs. This pipeline consists of the first $group or $sort and any remaining pipeline operators, and runs on the results received from the shards.
The $group operator brings in any “sub-totals” from the shards and combines them: in some cases these may be structures. For example, the $avg expression maintains a total and count for each shard; mongos combines these values and then divides.
| [1] | If an early $match can exclude shards through the use of the shard key in the predicate, then these operators are only pushed to the relevant shards. |
Aggregation operations with the aggregate command have the following limitations: