Big Data for the Hopelessly Relational
I've been working with relational databases for a long time. In fact, my very first job as a software engineer waaaaay back when was converting an MS Access database from one very old version to another very old version (I think it was the shiny new Access 2000). I can rattle off the difference between an inner, right, left and full join, I can write stored procedures and functions and triggers and constraints, and on a good day I can even (maybe) remember the difference between first, second and third normalized form. So as you can probably imagine, all this buzz around 'Big Data', 'NoSQL' and 'non-relational databases' had me quite confused, not to mention a little worried, about the future of data architecture. I imagined these stores of information just kind of globbed into a big pot, all mixing around with no relationship and no (gasp!) table structure and probably some other junk that fell in along the way, like clumps of dirt and dried leaves and such. In short, without a nice relational data structure, how could you possibly keep anything organized? And don't even get me started on duplication of data! If you've been picturing data relationally for as long as I have, I'm sure you can appreciate my consternation about the future of data storage. I know there are a lot of you out there. But rest easy, for I have finally made some sense of the NoSQL phenomenon and can clear a few things up.
First, let's discuss some misconceptions:
- Misconception: NoSQL is replacing relational databases: Uh, nope. It's just another option in your data storage toolbox. Would you throw out your favorite wooden spatula just because you got one of those fancy new silicon spatulas? No, you'd now have two awesome spatulas instead of just one. (I made some eggs this morning. Spatulas are high up in my mind right now.)
- Misconception: NoSQL is a New Thing©: It's actually been around since the mid 1960s, when the MultiValue database was designed. IBM got into the game with IMS (Information Management System) back in 1966.
- Misconception: NoSQL is better than relational databases: Again, nope. Just like I mentioned in number 1, it's just another tool, and you need to pick the best tool for the job at hand. If you're trying to drive a screw, you wouldn't use your trusty old hammer, would you? No, you'd use your slick power screwdriver. (Unless it was a sledgehammer, then all bets are off.)
- Misconception: If I'm using [insert language here], I have to/can't use a NoSQL database: Regardless of the language you're using, any data backend will do. It just depends on what your needs are.
So let's get into some of the key differences between relational databases and NoSQL databases. A relational database has one or more tables filled with rows of data organized by columns. You can create various kinds of relationships between those tables, as well as apply data constraints and triggers (functions that happen when you perform a CRUD action on a table or view). Let's take the always-popular (heh) database used in pretty much every single MS SQL code example ever, the tried and true Northwind database as an example:
There are many types of NoSQL databases, such as key-value, document, column and graph. The most popular NoSQL data store according to DB Engines (as of October '16) is MongoDB, which is a document store NoSQL database, so l'll base the rest of this comparison on the document store model. A MongoDB database is comprised of collections, which are akin to tables in an RDB. However, collections are like a group of JSON files, each file acting as a record. Here's a look at the Orders collection in a MongoDB version of the Northwind database:
That's all there is to finding the top two documents in the collection? Well that's pretty easy. Let's see if we can make that look any better:
Looks just like JSON, doesn't it? Now I often use
select distinct when I'm searching around in a database. How would I do that in MongoDB?
An array of the ShipCountry values. Ok, this is just too easy. Let's try some filtering:
More filtering, without all the prettifying:
Now that we've seen some basic select and filtering, how about the rest of the CRUD? Here's a select, insert, select, remove, select in the same number of lines:
Now check this out. Say I wanted to add a new record with a new field that doesn't exist:
Wait, what? There's no such field as ATotallyNewField in the Northwind Orders collection!! But that is both the beauty and the craziness of NoSQL: the schema is completely flexible and dynamic. You can add whatever data you want to a single record; there are no set fields a record can or cannot contain, with the exception of the ObjectID field. That comes with any document if you don't add a field yourself with '_id' in the name, and serves as the primary key. Yes, there are primary keys...NoSQL isn't completely unruly!
Let's not forget ordering:
I can tell your fear of NoSQL databases is starting to recede. But what about data normalization, you ask?
Remember, NoSQL databases are best suited to situations where you need a highly scalable data source with excellent performance which can support dynamic schemas. It is best suited to agile, dynamic data. Consistency of data is not as important across the entire database; therefore, data normalization is not as important. This is why NoSQL databases are well suited to situations with large amounts of data (reduced latency) where data integrity is not as important across the entire database — for example, a large article aggregation site with thousands of comments across hundreds of articles. In Mongodb, data normalization is generally given up in favor of embedded documents, looking something like this:
What you lose in consistency you gain in simplified queries and lower latency.
I hope this gentle exposure to a NoSQL database in a familiar context takes the edge of any fears of a relational database version of the end of days. Just remember — pick the right tool for the job!