>  Blog

The Aggregation Framework in MongoDB


Pradyumn Sharma

October 10, 2017


MongoDB's find() method is useful for simple queries from a collection, but it does not have an equivalent to the "group by" clause of SQL, for aggregating data.

However, MongoDB provides an aggregation framework that is far more powerful and flexible a way to aggregate and analyze data. The Aggregation Framework is essentially a multi-stage pipeline of data processing operations.

Pipeline is a common concept in many platforms, such as Linux. For example, consider the following statement in Linux:

cat 'filename' | more

The output of one command (in this case, cat) becomes the input to the next command (more) in a pipeline. No command is aware of its predecessor or successor. A pipeline can have as many stages as you need.


Let's build an example to illustrate some of the features of MongoDB's aggregation framework, in small, incremental steps.



The Dataset

We'll use a public domain dataset of the ZIP codes in USA, provided by MongoDB on their website, available for download here. The dataset looks like this:

{ "_id" : "01001", "city" : "AGAWAM", "loc" : [ -72.622739, 42.070206 ], "pop" : 15338, "state" : "MA" }
{ "_id" : "01002", "city" : "CUSHMAN", "loc" : [ -72.51564999999999, 42.377017 ], "pop" : 36963, "state" : "MA" }
{ "_id" : "01005", "city" : "BARRE", "loc" : [ -72.10835400000001, 42.409698 ], "pop" : 4546, "state" : "MA" }
{ "_id" : "01007", "city" : "BELCHERTOWN", "loc" : [ -72.41095300000001, 42.275103 ], "pop" : 10579, "state" : "MA" }
{ "_id" : "01008", "city" : "BLANDFORD", "loc" : [ -72.936114, 42.182949 ], "pop" : 1240, "state" : "MA" }
{ "_id" : "01010", "city" : "BRIMFIELD", "loc" : [ -72.188455, 42.116543 ], "pop" : 3706, "state" : "MA" }
{ "_id" : "01011", "city" : "CHESTER", "loc" : [ -72.988761, 42.279421 ], "pop" : 1688, "state" : "MA" }
{ "_id" : "01012", "city" : "CHESTERFIELD", "loc" : [ -72.833309, 42.38167 ], "pop" : 177, "state" : "MA" }
{ "_id" : "01013", "city" : "CHICOPEE", "loc" : [ -72.607962, 42.162046 ], "pop" : 23396, "state" : "MA" }
{ "_id" : "01020", "city" : "CHICOPEE", "loc" : [ -72.576142, 42.176443 ], "pop" : 31495, "state" : "MA" }
{ "_id" : "01022", "city" : "WESTOVER AFB", "loc" : [ -72.558657, 42.196672 ], "pop" : 1764, "state" : "MA" }
{ "_id" : "01026", "city" : "CUMMINGTON", "loc" : [ -72.905767, 42.435296 ], "pop" : 1484, "state" : "MA" }
{ "_id" : "01027", "city" : "MOUNT TOM", "loc" : [ -72.67992099999999, 42.264319 ], "pop" : 16864, "state" : "MA" }


Download the dataset to your computer, and then use the mongoimport command to load the data in your database (replacing the path of the file with an appropriate value):

mongoimport --db tutorial --collection zips --file "c:\downloads\zips.json"

If the command is entered correctly by you, and is successfully executed, you should get the messages as shown below:

2017-07-30T17:10:45.419+0530 connected to: localhost
2017-07-30T17:10:48.036+0530 imported 29353 documents



Accessing the Dataset in MongoDB

In the "mongo" client window, verify that the dataset is properly imported, by issuing the following command:

db.zips.count()

The output should be 29353.

Have a look at the data, by issuing the following command:

db.zips.count()

The output should be like this:

> db.zips.find()
{ "_id" : "01001", "city" : "AGAWAM", "loc" : [ -72.622739, 42.070206 ], "pop" : 15338, "state" : "MA" }
{ "_id" : "01002", "city" : "CUSHMAN", "loc" : [ -72.51565, 42.377017 ], "pop" : 36963, "state" : "MA" }
{ "_id" : "01005", "city" : "BARRE", "loc" : [ -72.108354, 42.409698 ], "pop" : 4546, "state" : "MA" }
{ "_id" : "01007", "city" : "BELCHERTOWN", "loc" : [ -72.410953, 42.275103 ], "pop" : 10579, "state" : "MA" }
{ "_id" : "01008", "city" : "BLANDFORD", "loc" : [ -72.936114, 42.182949 ], "pop" : 1240, "state" : "MA" }
{ "_id" : "01010", "city" : "BRIMFIELD", "loc" : [ -72.188455, 42.116543 ], "pop" : 3706, "state" : "MA" }
{ "_id" : "01011", "city" : "CHESTER", "loc" : [ -72.988761, 42.279421 ], "pop" : 1688, "state" : "MA" }
{ "_id" : "01012", "city" : "CHESTERFIELD", "loc" : [ -72.833309, 42.38167 ], "pop" : 177, "state" : "MA" }
{ "_id" : "01013", "city" : "CHICOPEE", "loc" : [ -72.607962, 42.162046 ], "pop" : 23396, "state" : "MA" }
{ "_id" : "01020", "city" : "CHICOPEE", "loc" : [ -72.576142, 42.176443 ], "pop" : 31495, "state" : "MA" }
{ "_id" : "01022", "city" : "WESTOVER AFB", "loc" : [ -72.558657, 42.196672 ], "pop" : 1764, "state" : "MA" }
{ "_id" : "01026", "city" : "CUMMINGTON", "loc" : [ -72.905767, 42.435296 ], "pop" : 1484, "state" : "MA" }
{ "_id" : "01027", "city" : "MOUNT TOM", "loc" : [ -72.679921, 42.264319 ], "pop" : 16864, "state" : "MA" }
{ "_id" : "01028", "city" : "EAST LONGMEADOW", "loc" : [ -72.505565, 42.067203 ], "pop" : 13367, "state" : "MA" }
{ "_id" : "01030", "city" : "FEEDING HILLS", "loc" : [ -72.675077, 42.07182 ], "pop" : 11985, "state" : "MA" }
{ "_id" : "01031", "city" : "GILBERTVILLE", "loc" : [ -72.198585, 42.332194 ], "pop" : 2385, "state" : "MA" }
{ "_id" : "01032", "city" : "GOSHEN", "loc" : [ -72.844092, 42.466234 ], "pop" : 122, "state" : "MA" }
{ "_id" : "01033", "city" : "GRANBY", "loc" : [ -72.520001, 42.255704 ], "pop" : 5526, "state" : "MA" }
{ "_id" : "01034", "city" : "TOLLAND", "loc" : [ -72.908793, 42.070234 ], "pop" : 1652, "state" : "MA" }
{ "_id" : "01035", "city" : "HADLEY", "loc" : [ -72.571499, 42.36062 ], "pop" : 4231, "state" : "MA" }


Example 1: State-wise total population

Let's start with a simple example to find out state-wise total population. This is the command we write:

db.zips.aggregate (
	{$group: 
		{
			_id: '$state',
			population: {$sum: '$pop'}
		}
	}
)

First (a part of) the output, before the explanation:

{ "_id" : "AK", "population" : 544698 }
{ "_id" : "AZ", "population" : 3665228 }
{ "_id" : "UT", "population" : 1722850 }
{ "_id" : "WY", "population" : 453528 }
{ "_id" : "CO", "population" : 3293755 }
{ "_id" : "ID", "population" : 1006749 }
{ "_id" : "OK", "population" : 3145585 }
{ "_id" : "AR", "population" : 2350725 }
{ "_id" : "LA", "population" : 4217595 }
{ "_id" : "NV", "population" : 1201833 }
{ "_id" : "NE", "population" : 1578139 }
{ "_id" : "KS", "population" : 2475285 }
{ "_id" : "MO", "population" : 5110648 }
{ "_id" : "IL", "population" : 11427576 }
{ "_id" : "OH", "population" : 10846517 }
{ "_id" : "KY", "population" : 3675484 }
{ "_id" : "IN", "population" : 5544136 }
{ "_id" : "WI", "population" : 4891769 }
{ "_id" : "HI", "population" : 1108229 }
{ "_id" : "MS", "population" : 2573216 }

We invoke the aggregation framework by calling the db.collectionname.aggregate() function. This function takes one or more JSON documents as pipeline parameters. Each JSON document represents one stage in the aggregation pipeline. In the above example, there is only one JSON document ($group). In other words, there is only one stage in this pipeline.

The $group stage is equivalent to the GROUP BY clause in a SQL SELECT query. It groups documents from the input collection based on what you specify as the value of the _id field. In the above example, we are grouping the documents based on the state field of the input document, which needs to be preceded by the $ sign and enclosed within quotes. The _id field is mandatory, and becomes part of the conceptual collection that is the output of the $group stage.

The additional fields that you specify in the $group stage become part of the output of this stage. In the above example, we store the sum of pop field from the input collection (specified as '$pop') in a new field called population.



Example 2: City-wise total population

Let us now query the collection for total population by cities. Since some city names exist in more than one state, we cannot use the city field alone to query city-wise total population. We now use a compound expression, consisting of city and state fields from the zips collection, in the form of a nested JSON document as the value of the _id field.

Here is the query:

db.zips.aggregate (
	{$group: 
		{
			_id: {city: '$city', state: '$state'},
			population: {$sum: '$pop'}
		}
	}
)

And here is a part of the output:

{ "_id" : { "city" : "POINT BAKER", "state" : "AK" }, "population" : 426 }
{ "_id" : { "city" : "CRAIG", "state" : "AK" }, "population" : 1398 }
{ "_id" : { "city" : "HYDABURG", "state" : "AK" }, "population" : 891 }
{ "_id" : { "city" : "KETCHIKAN", "state" : "AK" }, "population" : 14308 }
{ "_id" : { "city" : "PETERSBURG", "state" : "AK" }, "population" : 4253 }
{ "_id" : { "city" : "SITKA", "state" : "AK" }, "population" : 8638 }
{ "_id" : { "city" : "GUSTAVUS", "state" : "AK" }, "population" : 258 }
{ "_id" : { "city" : "ANGOON", "state" : "AK" }, "population" : 1002 }
{ "_id" : { "city" : "JUNEAU", "state" : "AK" }, "population" : 24947 }
{ "_id" : { "city" : "NUIQSUT", "state" : "AK" }, "population" : 354 }
{ "_id" : { "city" : "CHALKYITSIK", "state" : "AK" }, "population" : 99 }
{ "_id" : { "city" : "WHITE MOUNTAIN", "state" : "AK" }, "population" : 194 }
{ "_id" : { "city" : "AMBLER", "state" : "AK" }, "population" : 8 }
{ "_id" : { "city" : "TELLER", "state" : "AK" }, "population" : 260 }
{ "_id" : { "city" : "SHAKTOOLIK", "state" : "AK" }, "population" : 183 }
{ "_id" : { "city" : "SHUNGNAK", "state" : "AK" }, "population" : 0 }
{ "_id" : { "city" : "RUBY", "state" : "AK" }, "population" : 172 }
{ "_id" : { "city" : "NULATO", "state" : "AK" }, "population" : 492 }
{ "_id" : { "city" : "NOATAK", "state" : "AK" }, "population" : 395 }
{ "_id" : { "city" : "POINT LAY", "state" : "AK" }, "population" : 139 }


Example 3: Sort by Population

Let us sort the above output in the descending order of population. For this purpose, we use the $sort stage in the aggregation pipeline, as follows:

db.zips.aggregate (
	{$group: 
		{
			_id: {city: '$city', state: '$state'},
			population: {$sum: '$pop'}
		}
	},
	{$sort:
		{population: -1}
	}
)

And here is the output:

{ "_id" : { "city" : "CHICAGO", "state" : "IL" }, "population" : 2452177 }
{ "_id" : { "city" : "BROOKLYN", "state" : "NY" }, "population" : 2300504 }
{ "_id" : { "city" : "LOS ANGELES", "state" : "CA" }, "population" : 2102295 }
{ "_id" : { "city" : "HOUSTON", "state" : "TX" }, "population" : 2095918 }
{ "_id" : { "city" : "PHILADELPHIA", "state" : "PA" }, "population" : 1610956 }
{ "_id" : { "city" : "NEW YORK", "state" : "NY" }, "population" : 1476790 }
{ "_id" : { "city" : "BRONX", "state" : "NY" }, "population" : 1209548 }
{ "_id" : { "city" : "SAN DIEGO", "state" : "CA" }, "population" : 1049298 }
{ "_id" : { "city" : "DETROIT", "state" : "MI" }, "population" : 963243 }
{ "_id" : { "city" : "DALLAS", "state" : "TX" }, "population" : 940191 }
{ "_id" : { "city" : "PHOENIX", "state" : "AZ" }, "population" : 890853 }
{ "_id" : { "city" : "MIAMI", "state" : "FL" }, "population" : 825232 }
{ "_id" : { "city" : "SAN JOSE", "state" : "CA" }, "population" : 816653 }
{ "_id" : { "city" : "SAN ANTONIO", "state" : "TX" }, "population" : 811792 }
{ "_id" : { "city" : "BALTIMORE", "state" : "MD" }, "population" : 733081 }
{ "_id" : { "city" : "SAN FRANCISCO", "state" : "CA" }, "population" : 723993 }
{ "_id" : { "city" : "MEMPHIS", "state" : "TN" }, "population" : 632837 }
{ "_id" : { "city" : "SACRAMENTO", "state" : "CA" }, "population" : 628279 }
{ "_id" : { "city" : "JACKSONVILLE", "state" : "FL" }, "population" : 610160 }
{ "_id" : { "city" : "ATLANTA", "state" : "GA" }, "population" : 609591 }

The $sort stage takes a JSON document that specifies one or more fields to sort the input collection by, and the sort order (1 being ascending order, and -1 descending order).



Example 4: Count of Zip Codes for Cities

How many zip codes do various cities have? Let's see the output in the descending order of such count.

Well, surprisingly, MongoDB does not have a $count stage, but we can use {$sum: 1} to get the same result. Here is the query:

db.zips.aggregate (
	{$group: 
		{
			_id: {city: '$city', state: '$state'},
			zips: {$sum: 1}
		}
	},
	{$sort: 
		{zips: -1}
	}
)

And here is the output:

{ "_id" : { "city" : "HOUSTON", "state" : "TX" }, "zips" : 93 }
{ "_id" : { "city" : "LOS ANGELES", "state" : "CA" }, "zips" : 56 }
{ "_id" : { "city" : "PHILADELPHIA", "state" : "PA" }, "zips" : 48 }
{ "_id" : { "city" : "CHICAGO", "state" : "IL" }, "zips" : 47 }
{ "_id" : { "city" : "SAN ANTONIO", "state" : "TX" }, "zips" : 45 }
{ "_id" : { "city" : "DALLAS", "state" : "TX" }, "zips" : 44 }
{ "_id" : { "city" : "KANSAS CITY", "state" : "MO" }, "zips" : 41 }
{ "_id" : { "city" : "AUSTIN", "state" : "TX" }, "zips" : 40 }
{ "_id" : { "city" : "NEW YORK", "state" : "NY" }, "zips" : 40 }
{ "_id" : { "city" : "BROOKLYN", "state" : "NY" }, "zips" : 37 }
{ "_id" : { "city" : "SAN DIEGO", "state" : "CA" }, "zips" : 34 }
{ "_id" : { "city" : "MIAMI", "state" : "FL" }, "zips" : 34 }
{ "_id" : { "city" : "OKLAHOMA CITY", "state" : "OK" }, "zips" : 33 }
{ "_id" : { "city" : "PHOENIX", "state" : "AZ" }, "zips" : 33 }
{ "_id" : { "city" : "ATLANTA", "state" : "GA" }, "zips" : 31 }
{ "_id" : { "city" : "SAN JOSE", "state" : "CA" }, "zips" : 29 }
{ "_id" : { "city" : "TULSA", "state" : "OK" }, "zips" : 28 }
{ "_id" : { "city" : "SACRAMENTO", "state" : "CA" }, "zips" : 28 }
{ "_id" : { "city" : "OMAHA", "state" : "NE" }, "zips" : 27 }
{ "_id" : { "city" : "PORTLAND", "state" : "OR" }, "zips" : 26 }

Example 5: Limit the Query to a Specific State

Suppose we want to run the above query, only for the state of Illinois. We can use the $match stage to filter the documents that match a given condition, as shown below:

db.zips.aggregate (
	{$match:
		{state: 'IL'}
	},
	{$group: 
		{
			_id: '$city',
			zips: {$sum: 1}
		}
	},
	{$sort: 
		{zips: -1}
	}
)

Here is the output:

{ "_id" : "CHICAGO", "zips" : 47 }
{ "_id" : "ROCKFORD", "zips" : 8 }
{ "_id" : "PEORIA", "zips" : 5 }
{ "_id" : "NAPERVILLE", "zips" : 4 }
{ "_id" : "DECATUR", "zips" : 3 }
{ "_id" : "BELLEVILLE", "zips" : 3 }
{ "_id" : "JOLIET", "zips" : 3 }
{ "_id" : "OAK PARK", "zips" : 3 }
{ "_id" : "AURORA", "zips" : 3 }
{ "_id" : "LINCOLNWOOD", "zips" : 3 }
{ "_id" : "EVANSTON", "zips" : 3 }
{ "_id" : "EAST SAINT LOUIS", "zips" : 2 }
{ "_id" : "CHAMPAIGN", "zips" : 2 }
{ "_id" : "PEORIA HEIGHTS", "zips" : 2 }
{ "_id" : "RANTOUL", "zips" : 2 }
{ "_id" : "DOWNERS GROVE", "zips" : 2 }
{ "_id" : "SAINT CHARLES", "zips" : 2 }
{ "_id" : "SCHAUMBURG", "zips" : 2 }
{ "_id" : "HILLSIDE", "zips" : 2 }
{ "_id" : "HOFFMAN ESTATES", "zips" : 2 }

The $match stage takes a JSON document specifying the filter condition, with the same syntax as in the find() function of MongoDB. Since the output collection of this stage (and therefore the input collection of the next, $group stage) is limited to the {state: 'IL'}, we don't need to worry about including the state field in _id field of the $group stage.



Closing Remarks

The pipeline structure of the aggregation framework allows various operations to be applied, wherein the output of one stage is a conceptual collection that becomes the input to the next stage.

This article is only an introduction to a few features of the aggregation framework of MongoDB. There is much more to it, which I may cover in one or more follow-up articles some other time.