Introducing a huMONGOus Database

Nowadays archiving, searching and processing the explosion of data generated in applications means coming up with nontraditional ways of dealing with the data. NoSQL solutions offer intriguing and unique ways of handling the volumes of data available to us. Additionally, 10Gen offers an open source distributed document-oriented solution called MongoDB.

MongoDB straddles the NoSQL space nicely. A low barrier to entry and great performance help MongoDB continue to gain followers. However, like all database solutions, MongoDB will not solve all of your problems. You need to know when and how to use it properly and more importantly, when not to use it.

Oddly Familiar

MongoDB stores your data in documents using a JSON-style syntax known as BSON (binary JSON) making it a part of the document-oriented class of NoSQL solutions.

“MongoDB is considered a document-oriented datastore and it stores those documents in a JSON-style syntax called BSON (binary json).”

Document-oriented solutions can take some getting used to. Traditional RMDBS systems house their data in very well-defined schemas which are represented as tables. Each table definition is comprised of various columns which effectively define the data model in that RMDBS system. Each time data is inserted into a table, a new row is created. This data can be queried, updated, deleted and inserted using Structured Query Language (SQL). MongoDB, on the other hand, does not store its data in tables; MongoDB stores its data in collections. MongoDB collections are comprised of JSON documents instead of rows. Documents consist of key/value pairs - essentially a JSON hash. Unlike traditional RDMBS systems which adhere to a strict data schema, Mongo does not have a strict data schema. MongoDB doesn’t care if you have a key/value pair in one document but not in another. Each document can contain its own data structure (if needed) in the same collection. See Listing 1 for an example of the flexibility in action.

An example of schemaless flexibility, Listing 1 has four MongoDB documents that reside in the same collection. Each document has an _id and name key, but the similarities stop there. A couple of documents contain a platform key. One document has a Twitter key and some have operating_system and homepage keys.

A powerful command-line shell comes bundled with MongoDB. You use the MongoDB shell for managing the server, setting up authentication and everything else you need. Third-party graphical administration tools do exist for MongoDB and you’ll find them at http://www.mongodb.org/display/DOCS/Admin+UIs. These tools provide quick lookup and management capabilities for your MongoDB system.

MongoDB uses JavaScript in the MongoDB shell. Everything done in the shell will be JavaScript. When you enter a command without the parenthesis the code for the command displays. Having the ability to use JavaScript can be really powerful whether you are writing MapReduce queries or creating custom functions you want to use in the shell.

Installation

This article uses Mongo 2.0.2 for all of the examples, but anything over 2.x should work. You can download the latest version from http://www.mongodb.org/downloads. The MongoDB shell that comes bundled in the download will be sufficient for all examples.

After downloading and unzipping the binary you will need to create a path for the data to be stored. The default location is /data/db. MongoDB does not create this directory, so you need to create it yourself. After you’ve created the directory you can start MongoDB by simply calling:

/path/to/mongofiles/bin/mongod

If the installation went as it should, MongoDB is running and the real fun begins.

One Document to Rule Them All

The examples will use Major League Baseball players and their statistics. The database will be named mlb and contains a collection called players. The documents in players hold the individual player details and statistics.

All of the examples in this article will use the MongoDB shell that comes bundled with the download. The shell uses JavaScript for all commands and functions.

Now you’re going to start the Mongo shell so that you can interact with the MongoDB instance. To start the MongoDB shell, open a terminal/command prompt and navigate to the location in which you installed MongoDB. Then type the following commands:

/path/to/mongofiles/bin/mongo
use mlb;

This launched the Mongo shell, created a new database named mlb and switched to that database. Using the db command before each method call tells MongoDB to reference the mlb database.

You will now create a new collection and insert a new JSON document into that collection. In your terminal/command window, enter the following:

db.players.insert({
   name: "Harry Blanco",
   age: 40,
   bats: "right",
   throws: "left",
   years_played: [
      {year: 2011, team: "Arizona
       Diamondbacks"},
      {year: 2010, team: "New York Mets"}
   ]}
);
db.players.find();

You have now created a collection named players and inserted a document with the details of Harry Blanco. MongoDB does not require a predefined schema so your collection was created for free just by referencing it in the command. Similar to the flexibility with key/value pairs in documents, you can just as easily create new collections. The flexibility of schemaless design comes with a caveat; typos will create new collections or keys if you are not careful. You can see your new collection by entering the show collections command.

Retrieving Documents

db.players.find() returns all the documents in the collection. In our case we only have one document to return. If you look at the output you will see a key we did not insert labeled _id. _id refers to the MongoDB ObjectId that is assigned to all of your documents. Think of it as incrementing primary key field in a SQL system. This ObjectId gets created using a combination of items which go beyond the scope of this article, but have to do with MongoDB’s clustering and sharding capabilities.

MongoDB does not support joins like a typical SQL solution. Instead MongoDB uses embedded documents. Our insert command contained an array of embedded documents called years_played. Embedded documents allow you to make a single query and return all the data back without joining to other tables. Even better, you can query against those embedded documents and create indexes on keys contained within them.

Diving for Data

Go add a couple more players (Listing 2) to the mlb collection. After inserting the additional documents, run db.players.count() in the console/terminal to return the total number of records in the collection.

db.players.count();

The shell simply shows 3. Now how about all of the players that bat right-handed?

db.players.find({ bats: "right"});

The query returns two documents containing Harry and Konrad. What about players that played in the year 2011?

db.players.find({ “years_played.year”: 2011});

As mentioned earlier in the article, you can query by keys contained in embedded documents. In this instance you use years_played.year to dig down into the embedded documents located at years_played key and look for the child key year. What if you wanted to know who played on the Diamondbacks in the year 2010? You may be tempted to write this:

db.players.find({
   “years_played.year”: 2010,
   “years_played.team”: “Arizona Diamondbacks”
});

If you ran the query above you would notice it returned all three players. Unfortunately this is not what you want and you know from our previous queries that Harry Blanco played for the Mets in 2010. The previous query returns all three players because each of them meet at least one of the criteria requested. Getting the correct results requires use of a special operator.

MongoDB offers several features for more advanced queries, including the $elemMatch operator. $elemMatch ensures the documents returned match all of our criteria. Now try that query again using the $elemMatch operator.

db.players.find({
   "years_played":{
      $elemMatch: {
         year: 2010,
         team: "Arizona Diamondbacks"
      }
   }
});

That’s better. The query now returns only Konrad and Miguel as expected. You can find a list of the features offered for more advanced queries at http://www.mongodb.org/display/DOCS/Advanced+Queries.

Updates, Upserts and Document Replacement

Updating documents in your collection is not as straight forward as in SQL. For example, if we wanted to correct Miguel Montero’s age to 28 and that he throws right handed, we can’t simply do this:

db.players.update({"name": “Miguel Montero”},
            {age:28, throws: “right”}
);

The document-oriented nature of MongoDB causes the update to replace all the values in the current document with completely new values, removing keys you did not provide in the update. MongoDB replaces the document because it does not track the key/values in the document like a RDBMS does for columns. Using another special operator, we can avoid replacing the whole document.

db.players.update({"name": "Miguel Montero"},
            {$set: {age:28, throws: “right”}}
);

The addition of $set in the update function informs MongoDB to update the keys in the document and not to remove the rest. Look up Miguel and you will see the age and throws values updated while leaving the existing data intact.

Passing a true flag in at the update function tells MongoDB to do something called an upsert. Like it sounds, an upsert performs an update or an insert if the document is not found. Upserts are atomic and useful in a system where processes can be modifying and creating items at the same time.

db.players.update(
   {name: "Geoff Blum"},
   {name: "Geoff Blum",
   age: 38,
   bats: "right",
   throws: "right",
   years_played: [
      {year: 2011, team: "Arizona Diamondbacks"},
      {year: 2010, team: "Houston Astros"}
   ]},
   true
);

Run the code above and run db.players.find(). A new document containing Geoff Blum and his details will be returned. Assume that you made a mistake in your initial insert and realized that Geoff Blum actually bats switch so you’ll want to run that command again but change him to a switch hitter.

db.players.update(
   {name: "Geoff Blum"},
   {name: "Geoff Blum",
   age: 38,
   bats: "switch",
   throws: "right",
   years_played: [
      {year: 2011, team: "Arizona Diamondbacks"},
      {year: 2010, team: "Houston Astros"}
   ]},
   true
);

Query your collection again and you will see he was not added a second time but his record was updated to reflect his ability to switch hit.

House Cleaning

For most developers and DBAs, learning MongoDB requires a paradigm shift and learning to think about things differently can cause you to make mistakes, but they are easy enough to clean up in MongoDB. The remove command will let you delete documents based on the criteria you provide. For example, you can remove Miguel Montero with the following command:

db.players.remove({"name": “Miguel Montero”});

Not using any criteria will cause all records to be removed.

db.players.remove();

Or you can just drop the whole collection.

db.players.drop();

All Star Lineup of Advanced Features

MongoDB goes well beyond being a document-oriented data store with special tricks for accessing and updating your data. For example, MongoDB has built-in clustering called Replica Sets. MongoDB also has the ability to Shard across multiple servers and allows you to write MapReduce functions.

Replica Sets are a solution in which three or more servers operate with a single primary and multiple secondaries/arbiters. If you lose your primary for any reason, a vote will take place and one of the secondaries will be promoted, minimizing downtime and data loss. While you can always run MongoDB in a standalone solution, you should always use a Replica Set in your production environments.

When your app collects more data than you know what to do with you will find yourself needing to shard across multiple servers. MongoDB gives that option to you out of the box. The sharding capabilities of MongoDB makes scaling to handle your growth easy by setting up multiple Replica Sets and spreading your data across them with relative ease. Sharding can be useful when your inserts are coming faster than your disk can handle, when your collection reaches MongoDB’s 10TB collection limit or you need more performance for reads and need to share the load across servers.

MongoDB’s MapReduce functionality takes input from one collection and dumps the results out in another. You can use MapReduce for batch processing, data aggregation and performing analysis across large data sets. Write your MapReduce functions in JavaScript and you can have them included by default when launching the Mongo Shell.

You can find more information on all of these features and more at http://www.mongodb.org/display/DOCS/Home

Where to Next?

I encourage you to look through the operators linked to earlier and see what other queries you could run. MongoDB allows JavaScript, so be creative. It takes time to really grasp the doors MongoDB opens. Look at your existing apps and imagine how you could structure your data in MongoDB.

Introducing a huMONGOus Database

Published in:

Filed under:

Oddly Familiar

Installation

One Document to Rule Them All

Retrieving Documents

Diving for Data

Updates, Upserts and Document Replacement

House Cleaning

All Star Lineup of Advanced Features

Where to Next?

Listing 1: Sample Mongo documents

Listing 2: Inserting additional players

This article was filed under:

This article was published in:

Have additional technical questions?