- Aggregation >
- Aggregation Framework
Aggregation Framework¶
New in version 2.1.
Overview¶
The MongoDB aggregation framework provides a means to calculate aggregated values without having to use map-reduce. While map-reduce is powerful, it is often more difficult than necessary for many simple aggregation tasks, such as totaling or averaging field values.
If you’re familiar with SQL, the aggregation framework
provides similar functionality to GROUP BY
and related SQL
operators as well as simple forms of “self joins.” Additionally, the
aggregation framework provides projection capabilities to reshape the
returned data. Using the projections in the aggregation framework, you
can add computed fields, create new virtual sub-objects, and extract
sub-fields into the top-level of results.
See also
A presentation from MongoSV 2011: MongoDB’s New Aggregation Framework.
Additionally, consider Aggregation Framework Examples and Aggregation Framework Reference for more documentation.
Framework Components¶
This section provides an introduction to the two concepts that underpin the aggregation framework: pipelines and expressions.
Pipelines¶
Conceptually, documents from a collection pass through an
aggregation pipeline, which transforms these objects as they pass through.
For those familiar with UNIX-like shells (e.g. bash,) the concept is
analogous to the pipe (i.e. |
) used to string text filters together.
In a shell environment the pipe redirects a stream of characters from the output of one process to the input of the next. The MongoDB aggregation pipeline streams MongoDB documents from one pipeline operator to the next to process the documents. Pipeline operators can be repeated in the pipe.
All pipeline operators process a stream of documents and the pipeline behaves as if the operation scans a collection and passes all matching documents into the “top” of the pipeline. Each operator in the pipeline transforms each document as it passes through the pipeline.
Note
Pipeline operators need not produce one output document for every input document: operators may also generate new documents or filter out documents.
Warning
The pipeline cannot operate on values of the following types:
Binary
, Symbol
, MinKey
, MaxKey
, DBRef
,
Code
, and CodeWScope
.
Expressions¶
Expressions produce output documents based on calculations performed on input documents. The aggregation framework defines expressions using a document format using prefixes.
Expressions are stateless and are only evaluated when seen by the aggregation process. All aggregation expressions can only operate on the current document in the pipeline, and cannot integrate data from other documents.
The accumulator expressions used in the $group
operator maintain that state (e.g. totals, maximums, minimums, and
related data) as documents progress through the pipeline.
See also
Aggregation expressions for additional examples of the expressions provided by the aggregation framework.
Use¶
Invocation¶
Invoke an aggregation operation with the aggregate()
wrapper in the mongo
shell or the aggregate
database command. Always call aggregate()
on a
collection object that determines the input documents of the aggregation pipeline.
The arguments to the aggregate()
method specify a sequence of pipeline
operators, where each
operator may have a number of operands.
First, consider a collection of documents named articles
using the following format:
The following example aggregation operation pivots data to create a set of author names grouped by tags applied to an article. Call the aggregation framework by issuing the following command:
The aggregation pipeline begins with the collection
articles
and selects the author
and tags
fields using the
$project
aggregation operator. The
$unwind
operator produces one output document per
tag. Finally, the $group
operator pivots these fields.
Result¶
The aggregation operation in the previous section returns a document with two fields:
result
which holds an array of documents returned by the pipelineok
which holds the value1
, indicating success, or another value if there was an error
As a document, the result is subject to the BSON Document size limit, which is currently 16 megabytes.
Optimizing Performance¶
Because you will always call aggregate
on a
collection object, which logically inserts the entire collection into
the aggregation pipeline, you may want to optimize the operation
by avoiding scanning the entire collection whenever possible.
Pipeline Operators and Indexes¶
Depending on the order in which they appear in the pipeline, aggregation operators can take advantage of indexes.
The following pipeline operators take advantage of an index when they occur at the beginning of the pipeline:
The above operators can also use an index when placed before the following aggregation operators:
Early Filtering¶
If your aggregation operation requires only a subset of the data in a
collection, use the $match
operator to restrict which items go
in to the top of the pipeline, as in a query. When placed early in a
pipeline, these $match
operations use suitable indexes
to scan only the matching documents in a collection.
Placing a $match
pipeline stage followed by a
$sort
stage at the
start of the pipeline is logically equivalent to a single query
with a sort, and can use an index.
In future versions there may be an optimization phase in the
pipeline that reorders the operations to increase performance without
affecting the result. However, at this time place
$match
operators at the beginning of the pipeline when
possible.
Memory for Cumulative Operators¶
Certain pipeline operators require access to the entire input set
before they can produce any output. For example, $sort
must receive all of the input from the preceding pipeline
operator before it can produce its first output document. The current
implementation of $sort
does not go to disk in these
cases: in order to sort the contents of the pipeline, the entire input
must fit in memory.
$group
has similar characteristics: Before any
$group
passes its output along the pipeline, it must
operator, this frequently does not require as much memory as
$sort
, because it only needs to retain one record for
each unique key in the grouping specification.
The current implementation of the aggregation framework logs a warning if a cumulative operator consumes 5% or more of the physical memory on the host. Cumulative operators produce an error if they consume 10% or more of the physical memory on the host.
Sharded Operation¶
Note
Changed in version 2.1.
Some aggregation operations using aggregate
will
cause mongos
instances to require more CPU resources
than in previous versions. This modified performance profile may
dictate alternate architectural decisions if you use the
aggregation framework extensively in a sharded environment.
The aggregation framework is compatible with sharded collections.
When operating on a sharded collection, the aggregation pipeline
splits into two parts. The aggregation framework pushes all of the
operators up to the first $group
or $sort
operation to each shard. [1] Then, a second pipeline
on the mongos
runs. This pipeline consists of the first
$group
or $sort
and any remaining pipeline
operators, and runs on the results received from the shards.
The $group
operator brings in any “sub-totals” from
the shards and combines them: in some cases these may be
structures. For example, the $avg
expression
maintains a total and count for each shard; mongos
combines
these values and then divides.
[1] | If an early $match can exclude
shards through the use of the shard key in the predicate, then
these operators are only pushed to the relevant shards. |
Limitations¶
Aggregation operations with the aggregate
command have
the following limitations:
- The pipeline cannot operate on values of the following types:
Binary
,Symbol
,MinKey
,MaxKey
,DBRef
,Code
,CodeWScope
. - Output from the pipeline can only contain 16 megabytes. If
your result set exceeds this limit, the
aggregate
command produces an error. - If any single aggregation operation consumes more than 10 percent of system RAM the operation will produce an error.