OPTIONS

Aggregation with the Zip Code Data Set

The examples in this document use the zipcode collection. This collection is available at: media.mongodb.org/zips.json. Use mongoimport to load this data set into your mongod instance.

Data Model

Each document in the zipcode collection has the following form:

{
  "_id": "10280",
  "city": "NEW YORK",
  "state": "NY",
  "pop": 5574,
  "loc": [
    -74.016323,
    40.710537
  ]
}

The _id field holds the zip code as a string.

The city field holds the city.

The state field holds the two letter state abbreviation.

The pop field holds the population.

The loc field holds the location as a latitude longitude pair.

All of the following examples use the aggregate() helper in the mongo shell. aggregate() provides a wrapper around the aggregate database command. See the documentation for your driver for a more idiomatic interface for data aggregation operations.

Return States with Populations above 10 Million

To return all states with a population greater than 10 million, use the following aggregation operation:

db.zipcodes.aggregate( { $group :
                         { _id : "$state",
                           totalPop : { $sum : "$pop" } } },
                       { $match : {totalPop : { $gte : 10*1000*1000 } } } )

Aggregations operations using the aggregate() helper process all documents in the zipcodes collection. aggregate() connects a number of pipeline operators, which define the aggregation process.

In this example, the pipeline passes all documents in the zipcodes collection through the following steps:

  • the $group operator collects all documents and creates documents for each state.

    These new per-state documents have one field in addition to the _id field: totalPop which is a generated field using the $sum operation to calculate the total value of all pop fields in the source documents.

    After the $group operation the documents in the pipeline resemble the following:

    {
      "_id" : "AK",
      "totalPop" : 550043
    }
    
  • the $match operation filters these documents so that the only documents that remain are those where the value of totalPop is greater than or equal to 10 million.

    The $match operation does not alter the documents, which have the same format as the documents output by $group.

The equivalent SQL for this operation is:

SELECT state, SUM(pop) AS totalPop
       FROM zipcodes
       GROUP BY state
       HAVING totalPop >= (10*1000*1000)

Return Average City Population by State

To return the average populations for cities in each state, use the following aggregation operation:

db.zipcodes.aggregate( { $group :
                         { _id : { state : "$state", city : "$city" },
                           pop : { $sum : "$pop" } } },
                       { $group :
                       { _id : "$_id.state",
                         avgCityPop : { $avg : "$pop" } } } )

Aggregations operations using the aggregate() helper process all documents in the zipcodes collection. aggregate() connects a number of pipeline operators that define the aggregation process.

In this example, the pipeline passes all documents in the zipcodes collection through the following steps:

  • the $group operator collects all documents and creates new documents for every combination of the city and state fields in the source document.

    After this stage in the pipeline, the documents resemble the following:

    {
      "_id" : {
        "state" : "CO",
        "city" : "EDGEWATER"
      },
      "pop" : 13154
    }
    
  • the second $group operator collects documents by the state field and use the $avg expression to compute a value for the avgCityPop field.

The final output of this aggregation operation is:

{
  "_id" : "MN",
  "avgCityPop" : 5335
},

Return Largest and Smallest Cities by State

To return the smallest and largest cities by population for each state, use the following aggregation operation:

db.zipcodes.aggregate( { $group:
                         { _id: { state: "$state", city: "$city" },
                           pop: { $sum: "$pop" } } },
                       { $sort: { pop: 1 } },
                       { $group:
                         { _id : "$_id.state",
                           biggestCity:  { $last: "$_id.city" },
                           biggestPop:   { $last: "$pop" },
                           smallestCity: { $first: "$_id.city" },
                           smallestPop:  { $first: "$pop" } } },

                       // the following $project is optional, and
                       // modifies the output format.

                       { $project:
                         { _id: 0,
                           state: "$_id",
                           biggestCity:  { name: "$biggestCity",  pop: "$biggestPop" },
                           smallestCity: { name: "$smallestCity", pop: "$smallestPop" } } } )

Aggregation operations using the aggregate() helper process all documents in the zipcodes collection. aggregate() combines a number of pipeline operators that define the aggregation process.

All documents from the zipcodes collection pass into the pipeline, which consists of the following steps:

  • the $group operator collects all documents and creates new documents for every combination of the city and state fields in the source documents.

    By specifying the value of _id as a sub-document that contains both fields, the operation preserves the state field for use later in the pipeline. The documents produced by this stage of the pipeline have a second field, pop, which uses the $sum operator to provide the total of the pop fields in the source document.

    At this stage in the pipeline, the documents resemble the following:

    {
      "_id" : {
        "state" : "CO",
        "city" : "EDGEWATER"
      },
      "pop" : 13154
    }
    
  • $sort operator orders the documents in the pipeline based on the value of the pop field from largest to smallest. This operation does not alter the documents.

  • the second $group operator collects the documents in the pipeline by the state field, which is a field inside the nested _id document.

    Within each per-state document this $group operator specifies four fields: Using the $last expression, the $group operator creates the biggestcity and biggestpop fields that store the city with the largest population and that population. Using the $first expression, the $group operator creates the smallestcity and smallestpop fields that store the city with the smallest population and that population.

    The documents, at this stage in the pipeline resemble the following:

    {
      "_id" : "WA",
      "biggestCity" : "SEATTLE",
      "biggestPop" : 520096,
      "smallestCity" : "BENGE",
      "smallestPop" : 2
    }
    
  • The final operation is $project, which renames the _id field to state and moves the biggestCity, biggestPop, smallestCity, and smallestPop into biggestCity and smallestCity sub-documents.

The output of this aggregation operation is:

{
  "state" : "RI",
  "biggestCity" : {
    "name" : "CRANSTON",
    "pop" : 176404
  },
  "smallestCity" : {
    "name" : "CLAYVILLE",
    "pop" : 45
  }
}