OPTIONS

Aggregation with the Zip Code Data Set

The examples in this document use the zipcode collection. This collection is available at: media.mongodb.org/zips.json. Use mongoimport to load this data set into your mongod instance.

Data Model

Each document in the zipcode collection has the following form:

{
  "_id": "10280",
  "city": "NEW YORK",
  "state": "NY",
  "pop": 5574,
  "loc": [
    -74.016323,
    40.710537
  ]
}

The _id field holds the zip code as a string.

The city field holds the city name. A city can have more than one zip code associated with it as different sections of the city can each have a different zip code.

The state field holds the two letter state abbreviation.

The pop field holds the population.

The loc field holds the location as a latitude longitude pair.

All of the following examples use the aggregate() helper in the mongo shell. aggregate() provides a wrapper around the aggregate database command. See the documentation for your driver for a more idiomatic interface for data aggregation operations.

Return States with Populations above 10 Million

To return all states with a population greater than 10 million, use the following aggregation operation:

db.zipcodes.aggregate( { $group :
                         { _id : "$state",
                           totalPop : { $sum : "$pop" } } },
                       { $match : {totalPop : { $gte : 10*1000*1000 } } } )

Aggregations operations using the aggregate() helper process all documents in the zipcodes collection. aggregate() connects a number of pipeline operators, which define the aggregation process.

In this example, the pipeline passes all documents in the zipcodes collection through the following steps:

  • the $group operator collects all documents and creates documents for each state.

    These new per-state documents have one field in addition to the _id field: totalPop which is a generated field using the $sum operation to calculate the total value of all pop fields in the source documents.

    After the $group operation the documents in the pipeline resemble the following:

    {
      "_id" : "AK",
      "totalPop" : 550043
    }
    
  • the $match operation filters these documents so that the only documents that remain are those where the value of totalPop is greater than or equal to 10 million.

    The $match operation does not alter the documents, which have the same format as the documents output by $group.

The equivalent SQL for this operation is:

SELECT state, SUM(pop) AS totalPop
       FROM zipcodes
       GROUP BY state
       HAVING totalPop >= (10*1000*1000)

Return Average City Population by State

To return the average populations for cities in each state, use the following aggregation operation:

db.zipcodes.aggregate( [
   { $group : { _id : { state : "$state", city : "$city" }, pop : { $sum : "$pop" } } },
   { $group : { _id : "$_id.state", avgCityPop : { $avg : "$pop" } } }
] )

Aggregations operations using the aggregate() helper process all documents in the zipcodes collection. aggregate() connects a number of pipeline operators that define the aggregation process.

In this example, the pipeline passes all documents in the zipcodes collection through the following steps:

  • the $group operator collects all documents and creates new documents for every combination of the city and state fields in the source document. A city can have more than one zip code associated with it as different sections of the city can each have a different zip code.

    After this stage in the pipeline, the documents resemble the following:

    {
      "_id" : {
        "state" : "CO",
        "city" : "EDGEWATER"
      },
      "pop" : 13154
    }
    
  • the second $group operator collects documents by the state field and use the $avg expression to compute a value for the avgCityPop field.

The final output of this aggregation operation is:

{
  "_id" : "MN",
  "avgCityPop" : 5335
},

Return Largest and Smallest Cities by State

To return the smallest and largest cities by population for each state, use the following aggregation operation:

db.zipcodes.aggregate( { $group:
                         { _id: { state: "$state", city: "$city" },
                           pop: { $sum: "$pop" } } },
                       { $sort: { pop: 1 } },
                       { $group:
                         { _id : "$_id.state",
                           biggestCity:  { $last: "$_id.city" },
                           biggestPop:   { $last: "$pop" },
                           smallestCity: { $first: "$_id.city" },
                           smallestPop:  { $first: "$pop" } } },

                       // the following $project is optional, and
                       // modifies the output format.

                       { $project:
                         { _id: 0,
                           state: "$_id",
                           biggestCity:  { name: "$biggestCity",  pop: "$biggestPop" },
                           smallestCity: { name: "$smallestCity", pop: "$smallestPop" } } } )

Aggregation operations using the aggregate() helper process all documents in the zipcodes collection. aggregate() combines a number of pipeline operators that define the aggregation process.

All documents from the zipcodes collection pass into the pipeline, which consists of the following steps:

  • the $group operator collects all documents and creates new documents for every combination of the city and state fields in the source documents.

    By specifying the value of _id as a sub-document that contains both fields, the operation preserves the state field for use later in the pipeline. The documents produced by this stage of the pipeline have a second field, pop, which uses the $sum operator to provide the total of the pop fields in the source document.

    At this stage in the pipeline, the documents resemble the following:

    {
      "_id" : {
        "state" : "CO",
        "city" : "EDGEWATER"
      },
      "pop" : 13154
    }
    
  • $sort operator orders the documents in the pipeline based on the value of the pop field from largest to smallest. This operation does not alter the documents.

  • the second $group operator collects the documents in the pipeline by the state field, which is a field inside the nested _id document.

    Within each per-state document this $group operator specifies four fields: Using the $last expression, the $group operator creates the biggestcity and biggestpop fields that store the city with the largest population and that population. Using the $first expression, the $group operator creates the smallestcity and smallestpop fields that store the city with the smallest population and that population.

    The documents, at this stage in the pipeline resemble the following:

    {
      "_id" : "WA",
      "biggestCity" : "SEATTLE",
      "biggestPop" : 520096,
      "smallestCity" : "BENGE",
      "smallestPop" : 2
    }
    
  • The final operation is $project, which renames the _id field to state and moves the biggestCity, biggestPop, smallestCity, and smallestPop into biggestCity and smallestCity sub-documents.

The output of this aggregation operation is:

{
  "state" : "RI",
  "biggestCity" : {
    "name" : "CRANSTON",
    "pop" : 176404
  },
  "smallestCity" : {
    "name" : "CLAYVILLE",
    "pop" : 45
  }
}