OPTIONS

Metadata and Asset Management

Overview

This document describes the design and pattern of a content management system using MongoDB modeled on the popular Drupal CMS.

Problem

You are designing a content management system (CMS) and you want to use MongoDB to store the content of your sites.

Solution

To build this system you will use MongoDB’s flexible schema to store all content “nodes” in a single collection regardless of type. This guide will provide prototype schema and describe common operations for the following primary node types:

Basic Page
Basic pages are useful for displaying infrequently-changing text such as an ‘about’ page. With a basic page, the salient information is the title and the content.
Blog entry
Blog entries record a “stream” of posts from users on the CMS and store title, author, content, and date as relevant information.
Photo
Photos participate in photo galleries, and store title, description, author, and date along with the actual photo binary data.

This solution does not describe schema or process for storing or using navigational and organizational information.

Schema

Although documents in the nodes collection contain content of different types, all documents have a similar structure and a set of common fields. Consider the following prototype document for a “basic page” node type:

{
    _id: ObjectId(…),
    nonce: ObjectId(…),
    metadata: {
        type: 'basic-page'
        section: 'my-photos',
        slug: 'about',
        title: 'About Us',
        created: ISODate(...),
        author: { _id: ObjectId(…), name: 'Rick' },
        tags: [ ... ],
        detail: { text: '# About Us\n…' }
    }
}

Most fields are descriptively titled. The section field identifies groupings of items, as in a photo gallery, or a particular blog . The slug field holds a URL-friendly unique representation of the node, usually that is unique within its section for generating URLs.

All documents also have a detail field that varies with the document type. For the basic page above, the detail field might hold the text of the page. For a blog entry, the detail field might hold a sub-document. Consider the following prototype:

{
    …
    metadata: {
        …
        type: 'blog-entry',
        section: 'my-blog',
        slug: '2012-03-noticed-the-news',
        …
        detail: {
            publish_on: ISODate(…),
            text: 'I noticed the news from Washington today…'
        }
    }
}

Photos require a different approach. Because photos can be potentially larger than these documents, it’s important to separate the binary photo storage from the nodes metadata.

GridFS provides the ability to store larger files in MongoDB. GridFS stores data in two collections, in this case, cms.assets.files, which stores metadata, and cms.assets.chunks which stores the data itself. Consider the following prototype document from the cms.assets.files collection:

{
    _id: ObjectId(…),
    length: 123...,
    chunkSize: 262144,
    uploadDate: ISODate(…),
    contentType: 'image/jpeg',
    md5: 'ba49a...',
    metadata: {
        nonce: ObjectId(…),
        slug: '2012-03-invisible-bicycle',
        type: 'photo',
        section: 'my-album',
        title: 'Kitteh',
        created: ISODate(…),
        author: { _id: ObjectId(…), name: 'Jared' },
        tags: [ … ],
        detail: {
            filename: 'kitteh_invisible_bike.jpg',
            resolution: [ 1600, 1600 ], … }
    }
}

Note

This document embeds the basic node document fields, which allows you to use the same code to manipulate nodes, regardless of type.

Operations

This section outlines a number of common operations for building and interacting with the metadata and asset layer of the cms for all node types. All examples in this document use the Python programming language and the PyMongo driver for MongoDB, but you can implement this system using any language you choose.

Create and Edit Content Nodes

Procedure

The most common operations inside of a CMS center on creating and editing content. Consider the following insert() operation:

db.cms.nodes.insert({
    'nonce': ObjectId(),
    'metadata': {
        'section': 'myblog',
        'slug': '2012-03-noticed-the-news',
        'type': 'blog-entry',
        'title': 'Noticed in the News',
        'created': datetime.utcnow(),
        'author': { 'id': user_id, 'name': 'Rick' },
        'tags': [ 'news', 'musings' ],
        'detail': {
            'publish_on': datetime.utcnow(),
            'text': 'I noticed the news from Washington today…' }
        }
     })

Once inserted, your application must have some way of preventing multiple concurrent updates. The schema uses the special nonce field to help detect concurrent edits. By using the nonce field in the query portion of the update operation, the application will generate an error if there is an editing collision. Consider the following update

def update_text(section, slug, nonce, text):
    result = db.cms.nodes.update(
        { 'metadata.section': section,
          'metadata.slug': slug,
          'nonce': nonce },
        { '$set':{'metadata.detail.text': text, 'nonce': ObjectId() } },
        w=1)
    if not result['updatedExisting']:
        raise ConflictError()

You may also want to perform metadata edits to the item such as adding tags:

db.cms.nodes.update(
    { 'metadata.section': section, 'metadata.slug': slug },
    { '$addToSet': { 'tags': { '$each': [ 'interesting', 'funny' ] } } })

In this example the $addToSet operator will only add values to the tags field if they do not already exist in the tags array, there’s no need to supply or update the nonce.

Index Support

To support updates and queries on the metadata.section, and metadata.slug, fields and to ensure that two editors don’t create two documents with the same section name or slug. Use the following operation at the Python/PyMongo console:

>>> db.cms.nodes.ensure_index([
...    ('metadata.section', 1), ('metadata.slug', 1)], unique=True)

The unique=True option prevents to documents from colliding. If you want an index to support queries on the above fields and the nonce field create the following index:

>>> db.cms.nodes.ensure_index([
...    ('metadata.section', 1), ('metadata.slug', 1), ('nonce', 1) ])

However, in most cases, the first index will be sufficient to support these operations.

Upload a Photo

Procedure

To update a photo object, use the following operation, which builds upon the basic update procedure:

def upload_new_photo(
    input_file, section, slug, title, author, tags, details):
    fs = GridFS(db, 'cms.assets')
    with fs.new_file(
        content_type='image/jpeg',
        metadata=dict(
            type='photo',
            locked=datetime.utcnow(),
            section=section,
            slug=slug,
            title=title,
            created=datetime.utcnow(),
            author=author,
            tags=tags,
            detail=detail)) as upload_file:
        while True:
            chunk = input_file.read(upload_file.chunk_size)
            if not chunk: break
            upload_file.write(chunk)
    # unlock the file
    db.assets.files.update(
        {'_id': upload_file._id},
        {'$set': { 'locked': None } } )

Because uploading the photo spans multiple documents and is a non-atomic operation, you must “lock” the file during upload by writing datetime.utcnow() in the record. This helps when there are multiple concurrent editors and lets the application detect stalled file uploads. This operation assumes that, for photo upload, the last update will succeed:

def update_photo_content(input_file, section, slug):
    fs = GridFS(db, 'cms.assets')

    # Delete the old version if it's unlocked or was locked more than 5
    #    minutes ago
    file_obj = db.cms.assets.find_one(
        { 'metadata.section': section,
          'metadata.slug': slug,
          'metadata.locked': None })
    if file_obj is None:
        threshold = datetime.utcnow() - timedelta(seconds=300)
        file_obj = db.cms.assets.find_one(
            { 'metadata.section': section,
              'metadata.slug': slug,
              'metadata.locked': { '$lt': threshold } })
    if file_obj is None: raise FileDoesNotExist()
    fs.delete(file_obj['_id'])

    # update content, keep metadata unchanged
    file_obj['locked'] = datetime.utcnow()
    with fs.new_file(**file_obj):
        while True:
            chunk = input_file.read(upload_file.chunk_size)
            if not chunk: break
            upload_file.write(chunk)
    # unlock the file
    db.assets.files.update(
        {'_id': upload_file._id},
        {'$set': { 'locked': None } } )

As with the basic operations, you can use a much more simple operation to edit the tags:

db.cms.assets.files.update(
    { 'metadata.section': section, 'metadata.slug': slug },
    { '$addToSet': { 'metadata.tags': { '$each': [ 'interesting', 'funny' ] } } })

Index Support

Create a unique index on { metadata.section: 1, metadata.slug: 1 } to support the above operations and prevent users from creating or updating the same file concurrently. Use the following operation in the Python/PyMongo console:

>>> db.cms.assets.files.ensure_index([
...    ('metadata.section', 1), ('metadata.slug', 1)], unique=True)

Locate and Render a Node

To locate a node based on the value of metadata.section and metadata.slug, use the following find_one operation.

node = db.nodes.find_one({'metadata.section': section, 'metadata.slug': slug })

Note

The index defined (section, slug) created to support the update operation, is sufficient to support this operation as well.

Locate and Render a Photo

To locate an image based on the value of metadata.section and metadata.slug, use the following find_one operation.

fs = GridFS(db, 'cms.assets')
with fs.get_version({'metadata.section': section, 'metadata.slug': slug }) as img_fpo:
     # do something with the image file

Note

The index defined (section, slug) created to support the update operation, is sufficient to support this operation as well.

Search for Nodes by Tag

Querying

To retrieve a list of nodes based on their tags, use the following query:

nodes = db.nodes.find({'metadata.tags': tag })

Indexing

Create an index on the tags field in the cms.nodes collection, to support this query:

>>> db.cms.nodes.ensure_index('tags')

Search for Images by Tag

Procedure

To retrieve a list of images based on their tags, use the following operation:

image_file_objects = db.cms.assets.files.find({'metadata.tags': tag })
fs = GridFS(db, 'cms.assets')
for image_file_object in db.cms.assets.files.find(
    {'metadata.tags': tag }):
    image_file = fs.get(image_file_object['_id'])
    # do something with the image file

Indexing

Create an index on the tags field in the cms.assets.files collection, to support this query:

>>> db.cms.assets.files.ensure_index('tags')

Generate a Feed of Recently Published Blog Articles

Querying

Use the following operation to generate a list of recent blog posts sorted in descending order by date, for use on the index page of your site, or in an .rss or .atom feed.

articles = db.nodes.find({
    'metadata.section': 'my-blog'
    'metadata.published': { '$lt': datetime.utcnow() } })
articles = articles.sort({'metadata.published': -1})

Note

In many cases you will want to limit the number of nodes returned by this query.

Indexing

Create a compound index on the { metadata.section: 1, metadata.published: 1 } fields to support this query and sort operation.

>>> db.cms.nodes.ensure_index(
...     [ ('metadata.section', 1), ('metadata.published', -1) ])

Note

For all sort or range queries, ensure that field with the sort or range operation is the final field in the index.

Sharding

In a CMS, read performance is more critical than write performance. To achieve the best read performance in a sharded cluster, ensure that the mongos can route queries to specific shards.

Also remember that MongoDB can not enforce unique indexes across shards. Using a compound shard key that consists of metadata.section and metadata.slug, will provide the same semantics as describe above.

Warning

Consider the actual use and workload of your cluster before configuring sharding for your cluster.

Use the following operation at the Python/PyMongo shell:

>>> db.command('shardCollection', 'cms.nodes', {
...     key : { 'metadata.section': 1, 'metadata.slug' : 1 } })
{ "collectionsharded": "cms.nodes", "ok": 1}
>>> db.command('shardCollection', 'cms.assets.files', {
...     key : { 'metadata.section': 1, 'metadata.slug' : 1 } })
{ "collectionsharded": "cms.assets.files", "ok": 1}

To shard the cms.assets.chunks collection, you must use the _id field as the shard key. The following operation will shard the collection

>>> db.command('shardCollection', 'cms.assets.chunks', {
...     key : { 'files_id': 1 } })
{ "collectionsharded": "cms.assets.chunks", "ok": 1}

Sharding on the files_id field ensures routable queries because all reads from GridFS must first look up the document in cms.assets.files and then look up the chunks separately.