OPTIONS

Storing Log Data

Overview

This document outlines the basic patterns and principles for using MongoDB as a persistent storage engine for log data from servers and other machine data.

Problem

Servers generate a large number of events (i.e. logging,) that contain useful information about their operation including errors, warnings, and users behavior. By default, most servers, store these data in plain text log files on their local file systems.

While plain-text logs are accessible and human-readable, they are difficult to use, reference, and analyze without holistic systems for aggregating and storing these data.

Solution

The solution described below assumes that each server generates events also consumes event data and that each server can access the MongoDB instance. Furthermore, this design assumes that the query rate for this logging data is substantially lower than common for logging applications with a high-bandwidth event stream.

Note

This case assumes that you’re using an standard uncapped collection for this event data, unless otherwise noted. See the section on capped collections

Schema Design

The schema for storing log data in MongoDB depends on the format of the event data that you’re storing. For a simple example, consider standard request logs in the combined format from the Apache HTTP Server. A line from these logs may resemble the following:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "[http://www.example.com/start.html](http://www.example.com/start.html)" "Mozilla/4.08 [en] (Win98; I ;Nav)"

The simplest approach to storing the log data would be putting the exact text of the log record into a document:

{
  _id: ObjectId('4f442120eb03305789000000'),
 line: '127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "[http://www.example.com/start.html](http://www.example.com/start.html)" "Mozilla/4.08 [en] (Win98; I ;Nav)"'
}

While this solution does capture all data in a format that MongoDB can use, the data is not particularly useful, or it’s not terribly efficient: if you need to find events that the same page, you would need to use a regular expression query, which would require a full scan of the collection. The preferred approach is to extract the relevant information from the log data into individual fields in a MongoDB document.

When you extract data from the log into fields, pay attention to the data types you use to render the log data into MongoDB.

As you design this schema, be mindful that the data types you use to encode the data can have a significant impact on the performance and capability of the logging system. Consider the date field: In the above example, [10/Oct/2000:13:55:36 -0700] is 28 bytes long. If you store this with the UTC timestamp type, you can convey the same information in only 8 bytes.

Additionally, using proper types for your data also increases query flexibility: if you store date as a timestamp you can make date range queries, whereas it’s very difficult to compare two strings that represent dates. The same issue holds for numeric fields; storing numbers as strings requires more space and is difficult to query.

Consider the following document that captures all data from the above log entry:

{
     _id: ObjectId('4f442120eb03305789000000'),
     host: "127.0.0.1",
     logname: null,
     user: 'frank',
     time: ISODate("2000-10-10T20:55:36Z"),
     path: "/apache_pb.gif",
     request: "GET /apache_pb.gif HTTP/1.0",
     status: 200,
     response_size: 2326,
     referrer: "[http://www.example.com/start.html](http://www.example.com/start.html)",
     user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"
}

When extracting data from logs and designing a schema, also consider what information you can omit from your log tracking system. In most cases there’s no need to track all data from an event log, and you can omit other fields. To continue the above example, here the most crucial information may be the host, time, path, user agent, and referrer, as in the following example document:

{
     _id: ObjectId('4f442120eb03305789000000'),
     host: "127.0.0.1",
     time:  ISODate("2000-10-10T20:55:36Z"),
     path: "/apache_pb.gif",
     referer: "[http://www.example.com/start.html](http://www.example.com/start.html)",
     user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"
}

You may also consider omitting explicit time fields, because the ObjectId embeds creation time:

{
     _id: ObjectId('4f442120eb03305789000000'),
     host: "127.0.0.1",
     path: "/apache_pb.gif",
     referer: "[http://www.example.com/start.html](http://www.example.com/start.html)",
     user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"
}

System Architecture

The primary performance concern for event logging systems are:

  1. how many inserts per second can it support, which limits the event throughput, and

  2. how will the system manage the growth of event data, particularly concerning a growth in insert activity.

    In most cases the best way to increase the capacity of the system is to use an architecture with some sort of partitioning or sharding that distributes writes among a cluster of systems.

Operations

Insertion speed is the primary performance concern for an event logging system. At the same time, the system must be able to support flexible queries so that you can return data from the system efficiently. This section describes procedures for both document insertion and basic analytics queries.

The examples that follow use the Python programming language and the PyMongo driver for MongoDB, but you can implement this system using any language you choose.

Inserting a Log Record

Write Concern

MongoDB has a configurable write concern. This capability allows you to balance the importance of guaranteeing that all writes are fully recorded in the database with the speed of the insert.

For example, if you issue writes to MongoDB and do not require that the database issue any response, the write operations will return very fast (i.e. asynchronously,) but you cannot be certain that all writes succeeded. Conversely, if you require that MongoDB acknowledge every write operation, the database will not return as quickly but you can be certain that every item will be present in the database.

The proper write concern is often an application specific decision, and depends on the reporting requirements and uses of your analytics application.

Insert Performance

The following example contains the setup for a Python console session using PyMongo, with an event from the Apache Log:

>>> import bson
>>> import pymongo
>>> from datetime import datetime
>>> conn = pymongo.MongoClient()
>>> db = conn.event_db
>>> event = {
...     _id: bson.ObjectId(),
...     host: "127.0.0.1",
...     time:  datetime(2000,10,10,20,55,36),
...     path: "/apache_pb.gif",
...     referer: "[http://www.example.com/start.html](http://www.example.com/start.html)",
...     user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"
...}

The following command will insert the event object into the events collection.

>>> db.events.insert(event, w=0)

By setting w=0, you do not require that MongoDB acknowledges receipt of the insert. Although very fast, this is risky because the application cannot detect network and server failures. See Write Concern for more information.

If you want to ensure that MongoDB acknowledges inserts, you can pass w=1 argument as follows:

>>> db.events.insert(event, w=1)

MongoDB also supports a more stringent level of write concern, if you have a lower tolerance for data loss:

You can ensure that MongoDB not only acknowledge receipt of the message but also commit the write operation to the on-disk journal before returning successfully to the application, use can use the following insert() operation:

>>> db.events.insert(event, j=True)

Note

j=True implies w=1.

Finally, if you have extremely low tolerance for event data loss, you can require that MongoDB replicate the data to multiple secondary replica set members before returning:

>>> db.events.insert(event, w='majority')

This will force your application to acknowledge that the data has replicated to a majority of configured members of the replica set. You can combine options as well:

>>> db.events.insert(event, j=True, w='majority')

In this case, your application will wait for a successful journal commit on the primary and a replication acknowledgment from a majority of configured secondaries. This is the safest option presented in this section, but it is the slowest. There is always a trade-off between safety and speed.

Note

If possible, consider using bulk inserts to insert event data.

All write concern options apply to bulk inserts, but you can pass multiple events to the insert() method at once. Batch inserts allow MongoDB to distribute the performance penalty incurred by more stringent write concern across a group of inserts.

Finding All Events for a Particular Page

The value in maintaining a collection of event data derives from being able to query that data to answer specific questions. You may have a number of simple queries that you may use to analyze these data.

As an example, you may want to return all of the events associated with specific value of a field. Extending the Apache access log example from above, a common case would be to query for all events with a specific value in the path field: This section contains a pattern for returning data and optimizing this operation.

Query

Use a query that resembles the following to return all documents with the /apache_pb.gif value in the path field:

>>> q_events = db.events.find({'path': '/apache_pb.gif'})

Note

If you choose to shard the collection that stores this data, the shard key you choose can impact the performance of this query. See the sharding section of the sharding document.

Index Support

Adding an index on the path field would significantly enhance the performance of this operation.

>>> db.events.ensure_index('path')

Because the values of the path likely have a random distribution, in order to operate efficiently, the entire index should be resident in RAM. In this case, the number of distinct paths is typically small in relation to the number of documents, which will limit the space that the index requires.

If your system has a limited amount of RAM, or your data set has a wider distribution in values, you may need to re investigate your indexing support. In most cases, however, this index is entirely sufficient.

See also

The ensureIndex() JavaScript method and the db.events.ensure_index() method in PyMongo.

Finding All the Events for a Particular Date

The next example describes the process for returning all the events for a particular date.

Query

To retrieve this data, use the following query:

>>> q_events = db.events.find('time':
...     { '$gte':datetime(2000,10,10),'$lt':datetime(2000,10,11)})

Index Support

In this case, an index on the time field would optimize performance:

>>> db.events.ensure_index('time')

Because your application is inserting events in order, the parts of the index that capture recent events will always be in active RAM. As a result, if you query primarily on recent data, MongoDB will be able to maintain a large index, quickly fulfill queries, and avoid using much system memory.

See also

The ~db.events.ensureIndex() JavaScript method and the db.events.ensure_index() method in PyMongo.

Finding All Events for a Particular Host/Date

The following example describes a more complex query for returning all events in the collection for a particular host on a particular date. This kinds analysis may be useful for investigating suspicious behavior by a specific user.

Query

Use a query that resembles the following:

>>> q_events = db.events.find({
...     'host': '127.0.0.1',
...     'time': {'$gte':datetime(2000,10,10),'$lt':datetime(2000,10,11)}
... })

This query selects documents from the events collection where the host field is 127.0.0.1 (i.e. local host), and the value of the time field represents a date that is on or after (i.e. $gte) 2000-10-10 but before (i.e. $lt) 2000-10-11.

Index Support

The indexes you use may have significant implications for the performance of these kinds of queries. For instance, you can create a compound index on the time and host field, using the following command:

>>> db.events.ensure_index([('time', 1), ('host', 1)])

To analyze the performance for the above query using this index, issue the q_events.explain() method in a Python console. This will return something that resembles:

{ ...
  u'cursor': u'BtreeCursor time_1_host_1',
  u'indexBounds': {u'host': [[u'127.0.0.1', u'127.0.0.1']],
  u'time': [
      [ datetime.datetime(2000, 10, 10, 0, 0),
        datetime.datetime(2000, 10, 11, 0, 0)]]
  },
  ...
  u'millis': 4,
  u'n': 11,
  u'nscanned': 1296,
  u'nscannedObjects': 11,
  ... }

This query had to scan 1296 items from the index to return 11 objects in 4 milliseconds. Conversely, you can test a different compound index with the host field first, followed by the time field. Create this index using the following operation:

>>> db.events.ensure_index([('host', 1), ('time', 1)])

Use the q_events.explain() operation to test the performance:

{ ...
  u'cursor': u'BtreeCursor host_1_time_1',
  u'indexBounds': {u'host': [[u'127.0.0.1', u'127.0.0.1']],
  u'time': [[datetime.datetime(2000, 10, 10, 0, 0),
      datetime.datetime(2000, 10, 11, 0, 0)]]},
  ...
  u'millis': 0,
  u'n': 11,
  ...
  u'nscanned': 11,
  u'nscannedObjects': 11,
  ...
}

Here, the query had to scan 11 items from the index before returning 11 objects in less than a millisecond. By placing the more selective element of your query first in a compound index you may be able to build more useful queries.

Note

Although the index order has an impact query performance, remember that index scans are much faster than collection scans, and depending on your other queries, it may make more sense to use the { time: 1, host: 1 } index depending on usage profile.

See also

The ~db.events.ensureIndex() JavaScript method and the db.events.ensure_index() method in PyMongo.

Counting Requests by Day and Page

The following example describes the process for using the collection of Apache access events to determine the number of request per resource (i.e. page) per day in the last month.

Aggregation

New in version 2.1.

The aggregation framework provides the capacity for queries that select, process, and aggregate results from large numbers of documents. The aggregate() offers greater flexibility, capacity with less complexity than the existing mapReduce and group aggregation commands.

Consider the following aggregation pipeline: [1]

>>> result = db.command('aggregate', 'events', pipeline=[
...         {  '$match': {
...               'time': {
...                   '$gte': datetime(2000,10,1),
...                   '$lt':  datetime(2000,11,1) } } },
...         {  '$project': {
...                 'path': 1,
...                 'date': {
...                     'y': { '$year': '$time' },
...                     'm': { '$month': '$time' },
...                     'd': { '$dayOfMonth': '$time' } } } },
...         { '$group': {
...                 '_id': {
...                     'p':'$path',
...                     'y': '$date.y',
...                     'm': '$date.m',
...                     'd': '$date.d' },
...                 'hits': { '$sum': 1 } } },
...         ])

This command aggregates documents from the events collection with a pipeline that:

  1. Uses the $match to limit the documents that the aggregation framework must process. $match is similar to a find() query.

    This operation selects all documents where the value of the time field represents a date that is on or after (i.e. $gte) 2000-10-10 but before (i.e. $lt) 2000-10-11.

  2. Uses the $project to limit the data that continues through the pipeline. This operator:

    • Selects the path field.
    • Creates a y field to hold the year, computed from the time field in the original documents.
    • Creates a m field to hold the month, computed from the time field in the original documents
    • Creates a d field to hold the day, computed from the time field in the original documents.
  3. Uses the $group to create new computed documents. This step will create a single new document for each unique path/date combination. The documents take the following form:

    • the _id field holds a sub-document with the contents path field from the original documents in the p field, with the date fields from the $project as the remaining fields.
    • the hits field use the $sum statement to increment a counter for every document in the group. In the aggregation output, this field holds the total number of documents at the beginning of the aggregation pipeline with this unique date and path.

Note

In sharded environments, the performance of aggregation operations depends on the shard key. Ideally, all the items in a particular $group operation will reside on the same server.

While this distribution of documents would occur if you chose the time field as the shard key, a field like path also has this property and is a typical choice for sharding. Also see the “sharding considerations.” of this document for additional recommendations for using sharding.

See also

Data Aggregation

[1]To translate statements from the aggregation pipeline to SQL, you can consider the $match equivalent to WHERE, $project to SELECT, and $group to GROUP BY.

Index Support

To optimize the aggregation operation, ensure that the initial $match query has an index. Use the following command to create an index on the time field in the events collection:

>>> db.events.ensure_index('time')

Note

If you have already created a compound index on the time and host (i.e. { time: 1, host, 1 },) MongoDB will use this index for range queries on just the time field. Do not create an additional index, in these situations.

Sharding

Eventually your system’s events will exceed the capacity of a single event logging database instance. In these situations you will want to use a sharded cluster, which takes advantage of MongoDB’s sharding functionality. This section introduces the unique sharding concerns for this event logging case.

Limitations

In a sharded environment the limitations on the maximum insertion rate are:

  • the number of shards in the cluster.
  • the shard key you chose.

Because MongoDB distributed data in using “ranges” (i.e. chunks) of keys, the choice of shard key can control how MongoDB distributes data and the resulting systems’ capacity for writes and queries.

Ideally, your shard key should allow insertions balance evenly among the shards [2] and for most queries to only need to access a single shard. [3] Continue reading for an analysis of a collection of shard key choices.

[2]For this reason, avoid shard keys based on the timestamp or the insertion time (i.e. the ObjectId) because all writes will end up on a single node.
[3]For this reason, avoid randomized shard keys (e.g. hash based shard keys) because any query will have to access all shards in the cluster.

Shard by Time

While using the timestamp, or the ObjectId in the _id field, [4] would distribute your data evenly among shards, these keys lead to two problems:

  1. All inserts always flow to the same shard, which means that your sharded cluster will have the same write throughput as a standalone instance.
  2. Most reads will tend to cluster on the same shard, as analytics queries.
[4]The ObjectId derives from the creation time, and is effectively a timestamp in this case.

Shard by a Semi-Random Key

To distribute data more evenly among the shards, you may consider using a more “random” piece of data, such as a hash of the _id field (i.e. the ObjectId as a shard key.

While this introduces some additional complexity into your application, to generate the key, it will distribute writes among the shards. In these deployments having 5 shards will provide 5 times the write capacity as a single instance.

Using this shard key, or any hashed value as a key presents the following downsides:

  • the shard key, and the index on the key will consume additional space in the database.
  • queries, unless they include the shard key itself, [5] must run in parallel on all shards, which may lead to degraded performance.

This might be an acceptable trade-off in some situations. The workload of event logging systems tends to be heavily skewed toward writing, read performance may not be as critical as more robust write performance.

[5]Typically, it is difficult to use these kinds of shard keys in queries.

Shard by an Evenly-Distributed Key in the Data Set

If a field in your documents has values that are evenly distributed among the documents, you may consider using this key as a shard key.

Continuing the example from above, you may consider using the path field. Which may have a couple of advantages:

  1. writes will tend to balance evenly among shards.
  2. reads will tend to be selective and local to a single shard if the query selects on the path field.

There are a few potential problems with these kinds of shard keys:

  1. If a large number of documents will have the same shard key, you run the risk of having a portion of your data collection MongoDB cannot distribute throughout the cluster.
  2. If there are a small number of possible values, there may be a limit to how much MongoDB will be able to distribute the data among the shard.

Note

Test using your existing data to ensure that the distribution is truly even, and that there is a sufficient quantity of distinct values for the shard key.

Shard by Combine a Natural and Synthetic Key

MongoDB supports compound shard keys that combine the best aspects of sharding by a evenly distributed key in the set and sharding by a random key. In these situations, the shard key would resemble { path: 1 , ssk: 1 } where, path is an often used “natural key, or value from your data and ssk is a hash of the _id field. [6]

Using this type of shard key, data is largely distributed by the natural key, or path, which makes most queries that access the path field local to a single shard or group of shards. At the same time, if there is not sufficient distribution for specific values of path, the ssk makes it possible for MongoDB to create chunks and data across the cluster.

In most situations, these kinds of keys provide the ideal balance between distributing writes across the cluster and ensuring that most queries will only need to access a select number of shards.

[6]You must still calculate the value of this synthetic key in your application when you insert documents into your collection.

Test with Your Own Data

Selecting shard keys is difficult because: there are no definitive “best-practices,” the decision has a large impact on performance, and it is difficult or impossible to change the shard key after making the selection.

The sharding options provides a good starting point for thinking about shard key selection. Nevertheless, the best way to select a shard key is to analyze the actual insertions and queries from your own application.

Managing Event Data Growth

Without some strategy for managing the size of your database, most event logging systems can grow infinitely. This is particularly important in the context of MongoDB may not relinquish data to the file system in the way you might expect. Consider the following strategies for managing data growth:

Capped Collections

Depending on your data retention requirements as well as your reporting and analytics needs, you may consider using a capped collection to store your events. Capped collections have a fixed size, and drop old data when inserting new data after reaching cap.

Note

In the current version, it is not possible to shard capped collections.

Multiple Collections, Single Database

Strategy: Periodically rename your event collection so that your data collection rotates in much the same way that you might rotate log files. When needed, you can drop the oldest collection from the database.

This approach has several advantages over the single collection approach:

  1. Collection renames are fast and atomic.
  2. MongoDB does not bring any document into memory to drop a collection.
  3. MongoDB can effectively reuse space freed by removing entire collections without leading to data fragmentation.

Nevertheless, this operation may increase some complexity for queries, if any of your analyses depend on events that may reside in the current and previous collection. For most real time data collection systems, this approach is the most ideal.

Multiple Databases

Strategy: Rotate databases rather than collections, as in the “Multiple Collections, Single Database example.

While this significantly increases application complexity for insertions and queries, when you drop old databases, MongoDB will return disk space to the file system. This approach makes the most sense in scenarios where your event insertion rates and/or your data retention rates were extremely variable.

For example, if you are performing a large backfill of event data and want to make sure that the entire set of event data for 90 days is available during the backfill, during normal operations you only need 30 days of event data, you might consider using multiple databases.