When I started my company Attachments.me three years ago, NoSQL was a hot topic. Advocates preached impressive benefits:

  • NoSQL technologies, such as Cassandra, MongoDB, and CouchDB, offered replication and sharding out of the gate, making scaling headaches a thing of the past.
  • Schemaless modeling might make painful data-migrations a thing of the past. NoSQL allowed a developer to escape the tyranny of the DBA. These were technologies built for developers, by developers.

Advocates of SQL databases (crusty old DBAs, as I perceived them at the time) had their own arguments:

  • Technologies like MongoDB did not support transactions; furthermore, writes were deferred-how could anyone trust that data was persisted?
  • SQL databases had benefited from decades of industry deployment and academic research. Why are people so quick to disregard this?

In my career, I had witnessed a clear divide between the system-administrator and the developer. I wanted to avoid this in my own company.

In my career, I had witnessed a clear divide between the system-administrator and the developer. I wanted to avoid this in my own company: I loved the idea of a database designed with the developer in mind; I had biases towards the counter-arguments of individuals who, at the time, I perceived as contributing to this divide. These facts helped motivate my decision to use MongoDB as our primary databases.

Several years after the fact, do I regret my choice to use MongoDB? No, but the experience hasn't been all sunshine and roses:

  • NoSQL databases are not a magic bullet: you will have scaling problems, they will fall over, and they are difficult technologies to understand.
  • SQL databases, such as PostgreSQL, Oracle, and MySQL, are incredible pieces of technology. They benefit from decades of research. You should think long and hard before disregarding these battle-hardened technologies for NoSQL solutions.
  • Scaling an application to millions of users is, and always will be, a difficult problem.

In the argument of NoSQL vs. SQL, there are no absolutes. MongoDB has its warts, but it's a great technology. I'm excited to introduce you to it!

I love hands on examples. Throughout this article I'll be modeling a music store. It will have entities, such as albums, bands, artists, record-labels, etc. Using this as a foundation, I'll discuss how MongoDB differs from SQL technologies, examine its powerful query DSL, and discuss scaling MongoDB. I'll look at domains in which MongoDB excels, and discuss limitations that sometimes make it a sub-optimal technology choice.

It's my goal to give a thorough, unbiased, introduction to MongoDB, without pulling any punches.

NoSQL at 30,000 Feet

MongoDB is a document-oriented database. How does it differ from a table-based technology, such as PostgreSQL? Take two SQL databases, say, for example, PostgreSQL and MySQL. I could endlessly enumerate their slight variations. Cutting through the noise, I see MongoDB's schemaless data-representation and its lack of joins as the most important differentiation between it and a SQL-based technology.

MongoDB is Schemaless

In a SQL database, a table's columns are clearly defined. When a row is populated inside a table, you can be assured that each column will be present and that it will have a value that adheres to the scheme definition, e.g., the age field will always contain an integer. In a MongoDB collection, two documents can have different fields, and the same field can have different datatypes. Take the following collection as an example:

Note: Throughout this article, I will use JavaScript code-samples to illustrate model structure, and the data stored within MongoDB. MongoDB's CLI is powered by JavaScript, making this a natural way to present examples.

var artists = [
    {
        first: "David",
        last: "Bowie",
        age: "sixty-six"
    },
    {
        first: "John",
        middle: "Winston",
        last: "Lennon",
        age: 30
    }
]

This data is valid in MongoDB, even though the format of age varies, and the middle name field is only present in one of the two documents. With MongoDB, the onus is ultimately on the developer to ensure data-integrity.

MongoDB Lacks Joins

MongoDB is non-relational. To illustrate what this means, let's start thinking about the music database that will be modeled throughout this article. Suppose that I have two entities: albums, and bands. An album can have multiple bands, and a band can produce multiple albums. In an SQL paradigm here's how we might model this relationship:

Create a band table that has columns such as id, name, genre, and formation-date.

Create an album table that has columns such as title, label, cover-art, etc.

Create the relational table band_produced_album, that joins bands to albums on their respective id fields.

Because MongoDB does not support joins, I instead model albums in what is called de-normalized form. This refers to storing fields from one entity in your domain, on another entity in the domain, e.g., storing a copy of a document representing a band, within a document representing an album:

Var albums = [{
    title: 'Hunky Dory',
    label: 'RCA',
    band: {
        name: 'Ziggy Stardust',
        genre: ['glam-rock']
    }
}]

Rather than joining the data from two separate collections, I store a representation of the band inside the album document. A de-normalized form has the benefit of reducing the computational overhead associated with performing a join, but it, can make maintaining data-integrity difficult. If a modification is made to the canonical band document, e.g., a member leaves the band, I must ensure that every inner-band-object is updated.

Now that you've explored the conceptual differences between modeling for NoSQL databases, such as MongoDB, vs. modeling for relational-databases, let's dive into a real modeling problem.

Modeling in MongoDB

MongoDB supports the primitive datatypes that you might expect: floats, doubles, integers, dates, and strings. It also supports several special datatypes, including: regular expressions: symbols, code, binary-data, and ObjectIDs.

In the music database, the models will consist of primitive datatypes, inner-arrays, and inner-objects. I will, however, use ObjectIDs as the primary keys. An ObjectID is a 96-bit value, which is guaranteed to be unique, even when generated concurrently on different nodes in a distributed system. This becomes important later on, if it becomes necessary to scale MongoDB into a distributed cluster. Arrays and objects are important building blocks for modeling in MongoDB because they allow us to nest inner-documents de-normalized within parent documents.

Suppose that you'd like to build an online music store using MongoDB. For the MVP, let's assume that there are two primary entities in the database: bands and albums. Here's a beginner's first pass at modeling both entities:

// a first pass at a band model.
var band = {
    name: 'Pink Floyd',
    years_active: [
        new Date('Wed Jun 19 1965'),
        new Date('Wed Jun 19 1995')
    ],
    members: [
        'Roger Waters',
        'David Gilmour',
        'Nick Mason'
    ]
};

// a first pass at an album model.

var album = {
    title: 'Dark Side of the Moon',
    release_date: new Date('March 1 1973'),
    label: 'Capitol',
    downloads: 36,
    genres: ['progressive', 'rock']
};

How should you store these entities? Do you need a collection for both entities? These questions are dependent on the system's requirements Let's assume the following requirements:

  • When a user visits the music store, they will be presented with a list of popular albums.
  • A user can search for albums by title and genre.
  • A user should be able to look up albums by a particular era, e.g., the 1970s.

Because data is being presented in an album-centric way, the band model lends itself well to being stored as an inner-object on an album document. Here's a first pass at the model structure, given the requirements outlined:

var albums = [{
    _id: ObjectId("51c26eaa862917d810e13700"),
    title: 'Dark Side of the Moon',
    release_date: new Date('March 1 1973'),
    label: 'Capitol',
    downloads: 36,
    bands: [{
        name: 'Pink Floyd',
        years_ative: [
            new Date('Wed Jun 19 1965'),
            new Date('Wed Jun 19 1995')
        ],
        members: [
            'Roger Waters',
            'David Gilmour',
            'Nick Mason'
        ],
        genres: ['progressive', 'rock']
    }]
}];

Notice that rather than having their own collections, bands are stored as inner-objects on the album collection.

Suppose that I extended the initial problem definition by adding this requirement:

  • A user should be able to view a list of bands, sorted by name or formation date.

This requirement suggests that you will also need a band collection; it would be difficult otherwise to query for a sorted list of bands to present to a user. Listing 1 shows the form that the modified database structure might take on.

Listing 1: Modeling The Music Store in MongoDB

var albums = [{
    _id: ObjectID('51c26eaa862917d810e13700'),
    title: 'Dark Side of the Moon',
    release_date: new Date('March 1 1973'),
    label: 'Capitol',
    downloads: 36,
    bands: [{
        _id: ObjectId('51c27315694b1a8d60e0e3ad'),
        name: 'Pink Floyd',
        years_active: [
            new Date('Wed Jun 19 1965'),
            new Date('Wed Jun 19 1995'),
        ]
        members: [
            'Roger Waters',
            'David Gilmour',
            'Nick Mason',
        ]
        genres: ['progressive', 'rock']
    }]
}];

var bands = [{
    _id: ObjectId('51c27315694b1a8d60e0e3ad'),
        name: 'Pink Floyd',
        years_active: [
            new Date('Wed Jun 19 1965'),
            new Date('Wed Jun 19 1995'),
        ]
        members: [
            'Roger Waters',
            'David Gilmour',
            'Nick Mason',
        ]
        genres: ['progressive', 'rock']
}];

Some things worth noting:

  • You still store a band's inner-object on the album document. Given that there are no joins in MongoDB, this allows you to run queries such as: give me all of the albums produced by bands within the rock genre.
  • The band's inner-object now has an ObjectID associated with it. If the canonical band object is changed, e.g., a member leaves Pink Floyd, this ObjectID can be used to make sure that all dependent documents are updated.

Having taken a stab at the model for the music store, let's take a look at how you would perform queries on it. This is a great way to figure out whether the design actually holds water.

MongoDB's Query Language

MongoDB has a powerful query language and it would definitely be to your benefit to give this section of the manual a read. It's located here: https://www.mongodb.com/docs/manual/reference/operator/

An important thing to note when performing queries in MongoDB is that you can query within inner documents and inner arrays. For instance, take this collection:

var docs = [
    "_id": ObjectId('51c276d8694b1a8d60e0e3ae'),
    outer: [{'hello': 'world'}]
];

You can retrieve the document stored in this collection with the query db.docs.find({'outer.hello': 'world'}). The ability to perform queries on inner objects allows you to retrieve data in a similar way to performing a join on two SQL tables.

Let's go back to the music store example. Given the requirements I've outlined, let's look at the queries you would need to write:

  • When a user visits the music store, they will be presented with a list of the most popular albums:
// this query returns a list of all albums, in
// descending order based on the number of downloads.
db.albums.find({}).sort({downlods: -1});
  • A user can search for albums by title, band, and genre:
// return albums with title "Dark Side of the Moon".
db.albums.find({title: 'Dark Side of the Moon'});

// return albums by the band Pink Floyd.
db.albums.find({'bands.name': 'Pink Floyd'});

// return albums by genre.
Db.albums.find({'bands.genre': 'rock'});
  • A user should able to look up albums by a particular era, e.g., the 1970s:
// returns a list of albums from the 1970s.
db.albums.find({
    $and: [
        {release_date: {$gt: new Date('1970')}},
        {release_date: {$lt: new Date('1979')}}
    ]
});
  • A user should be able to view a list of bands, sorted by name or formation date:
// returns a list of bands sorted in ascending
// alphabetical order on name.
db.bands.find({}).sort({name: 1});

// returns a list of bands sorted in descending
// order by the year they were formed.
db.bands.find({}).sort({'years_active.0': -1});

None of those queries were too painful to write, and this helps to validate that you've created a model that fits the application's requirements. Suppose, however, that the music store is taking off like a rocket ship. What concerns might you run into when it comes time to scale the dataset to millions of documents and the user-base to millions of users?

Scaling MongoDB

MongoDB's built-in sharding support is one of its most touted features. Going back to the music store example, here's a hypothetical scenario:

  • You've been adding a back catalog of hundreds of thousands of albums to the store, and searching for albums by genre is becoming painfully slow

“Until a dataset reaches truly massive scales, a missing index is the most likely culprit for slow queries.”

At this point, sharding should be the last thing on your mind. Until a dataset reaches truly massive scales, a missing index is the most likely culprit for slow queries. Much like adding an index to a column on a SQL table, MongoDB allows you to index fields within documents (including being able to add indexes to the fields of inner objects and arrays). To hunt down slow queries in the music application, you first turn on profiling:

// this command will turn on slow query logging for
// queries longer than 100ms.
db.setProfilingLevel(1, 100);

Once you've isolated a slow query, you can create an index on the offending field. In the album model, it's become apparent that you might need to add an index on the genres field:

// create an ascending index on the genres array
// on an album's inner band object.
db.albums.ensureIndex({'bands.genres': 1});

// explain demonstrates we're now hitting
// BTree index, rather than performing a row scan.
db.albums.find({'bands.genres': 'rock'}).explain()

// Outputs:

{
    "cursor": "BtreeCursor bands.genres_1"
}

As you scale the music store to millions of users, you ascertain that missing indexes are no longer the cause of the scaling headaches. There's an important question that you need to ask next: Is the application write-heavy or read-heavy?

Replication

If the application is read-heavy, replication should be the next thing you consider. Replication refers to having a single master MongoDB server, with one or more slave servers (replicas) associated with it. All writes must be directed at the master server, but reads can be distributed across the replicas. A caveat is that there will be lag as data is propagated from the master server to the slave servers. This can, however, be acceptable for many types of applications. It's not the end of the world if it takes a few hundred milliseconds for a new album to propagate across all of the replicas in the music store's MongoDB cluster, for instance.

Sharding

At the conceptual level, sharding refers to dividing your dataset up over multiple MongoDB servers. You pick a field to act as a division point in your dataset, referred to as a shard key. Based on this shard key, data is distributed across multiple servers.

There are two scenarios that might ultimately motivate you to shard MongoDB: your dataset has become truly massive and indexes cannot fit in memory any longer, or your application is write-heavy and you need to distribute writes over multiple servers.

For performance reasons, it's important that a database's frequently accessed indices fit into memory. This is especially true for a virtualized environment like AWS.

For performance reasons, it's important that a database's frequently accessed indices fit into memory. This is especially true for a virtualized environment like Amazon Web Services, where there's more overhead associated with paging data from disk into memory. As a dataset grows, so will the size of its indices. Sharding allows you to divide up a dataset across multiple servers, reducing the amount of memory required on each.

Writes can become a bottleneck in a production MongoDB deployment: in a replica set only the master accepts writes, which can eventually cause problems in a write-heavy system. Furthermore, MongoDB has an aggressive locking strategy. When data is written into the database, a global read/write lock is obtained. If new data is being rapidly populated, reads slow down significantly, resulting in a bad user experience. Sharding helps address this problem. Multiple shards can be written concurrently, allowing the write throughput of MongoDB to be improved.

In practice, setting up sharding with MongoDB is non-trivial:

  • Replica sets are created for each shard. A replica set consists of a single master server, and zero or more replica servers.
  • Redundant config servers are deployed. Config servers are servers that maintain meta-information about how data is distributed amongst shards.
  • Mongo's routing processes are installed on each of the application servers. Rather than connecting directly to a shard, an application connects through these Mongo daemons.
  • A shard key must be selected for the collection you wish to shard. There are a lot of considerations that need to be taken into account when choosing a shard key: How well does it distribute data? How frequently will queries need to span multiple shards? Will newly populated data be written to a single shard a disproportionate amount of the time?

Phew, sharding is a complex problem. Luckily, it's something that can (and should) be avoided until your application is gaining some real traction. But before you get to this point, is MongoDB a good technology choice in the first place?

Some Final Thoughts on MongoDB

I'd like to conclude this article with some final thoughts about MongoDB, centered around the question of “when is MongoDB a good fit?”

Big Data

MongoDB excels when applied to large datasets that require queries to span the entire corpus. MongoDB's sharding functionality can be leveraged to handle this class of problem. Craigslist, Foursquare, and Bit.ly are great examples of MongoDB being used to perform queries on a single monolithic dataset.

Multi-Tenant Systems

I would argue that SQL databases lend themselves better to multi-user systems. That is, domains where queries tend to be restricted to a single user. Great examples of domains that fit this description are e-commerce sites like PayPal, or productivity tools like Asana. Both of these use SQL-based technologies to store their user's data. Why is it that MongoDB does not lend itself well to this class of problem?

  • Its clustering capabilities are overkill for the scaling requirements that arise from this type of domain.
  • It has not been designed with this class of problem in mind: global write locks make writing one user's data block into writes for all other users. There's a high overhead resource for every collection created, making partitioning users across multiple collections difficult.

Developer Friendliness

MongoDB achieves the goal of being developer friendly.

  • It has great documentation and well-written clients for most major languages.
  • Its JavaScript-based query DSL is easy for Web developers to pick up and master.
  • It's easy to install, so you can get up and running with it within the hour.

For these reasons, I think MongoDB is a great technology choice for rapidly prototyping an application. If you're competing in a hackathon I highly endorse using MongoDB.

In Conclusion

MongoDB is a powerful technology. It has a great query DSL, wonderful developer documentation and libraries, and built in support for scaling and replication. When it comes time to choose your production database however - as with any technology - you should do your research, and find out if MongoDB fits your requirements. There's no such thing as a free lunch.