The last few years have seen the emergence of new database solutions such as MongoDB, Cassandra, Neo4j, DynamoDB, and HBase, as prominent, mainstream platforms of choice for many enterprise applications and large web based software services.
- The Unique Identification Authority of India (UIDAI) uses MongoDB for Aadhaar. Metlife, Expedia, City of Chicago, Cisco, eBay, Royal Bank of Scotland also use MongoDB.
- eBay, Netflix, Target, Cisco, British Gas, Coursera, ING Vysya Bank use Cassandra
- Walmart, eBay, Telenor, Adidas, UBS, LInkedIn, Cambridge University Press use Neo4j
- Amazon, Duolingo, Lyft, Salesforce, CA Technologies use DynamoDB
Collectively these relatively newer database solutions are known as NoSQL (no SQL). The term NoSQL may be a slight misnomer, as some of these do use a bit of SQL (leading to an alternative, contrived interpretation of the term NoSQL as Not only SQL). But it serves to highlight the fact that most of these database solutions have significantly different architecture, philosophy and characteristics compared to the relational databases, such as Oracle, SQL Server and DB2.
The Relational Database Management Systems (RDBMS) have dominated the data storage side of the software applications landscape for more than four decades. These are stable and robust solutions and provide a standardized language for database access (SQL), though the extensions (such as PL/SQL and T-SQL) for complex tasks such as stored procedures are platform-specific and non-standard.
The RDBMS platforms handle transactions with great reliability and ease. Common RDBMS concepts such as normalization, indexes, views, constraints, triggers are pretty well understood by most developers, and are a natural part of their vocabulary.
What, then, has led to the rise of these alternative database platforms known as NoSQL, and how do these differ from the venerable RDBMS platforms?
RDBMS: The Data Volume Challenge
The last decade has seen an unprecedented and rapid growth in the amount of data that need to be stored in various applications. For example:
- In 2013, Twitter shared the statistics that about 500 million tweets were generated every day. Surely, this volume must be much higher than that now in 2017 (most of these companies don’t often share the statistics about their data volumes)
- As of April 2014, Facebook had about 300 petabytes of data (1 petabyte = 1000 terabytes), with a daily addition of about 600 terabytes of data. As of December 2016, Facebook had about 1.15 billion mobile, daily active users.
- Walmart has about one million customer transactions per hour.
- And the granddaddy of all: Google. A third-party project, in 2014, estimated the data stored by Google to be about 10-15 exabytes (1 exabyte = 1 million terabytes).
How do you store such humongous volumes of data. The relational databases were originally architected for a single server which can respond to a large number of client connections. With a single server, you have a limit upto which you can add storage capacity and computing power to scale it up.
Even though many of the RDBMS platforms have evolved to work with (small) clusters of servers, that does not come naturally to the RDBMS platforms, and requires a lot of complex software engineering effort.
NoSQL Solution: Distributed Architecture
NoSQL platforms don’t have the burden of old architecture that RDBMS platforms carry. They have been architected from the ground up, to deal with the challenges of large data volumes (and many others, as we’ll see later in this article).
The approach taken by NoSQL platforms is to replace scaling up with scaling out. Instead of increasing the capacity of a single server (or a small number of them in a cluster), these platforms use a distributed architecture, in which you can have hundreds or thousands of commodity servers in a geographically distributed cluster.
For example, Cassandra uses a cluster of peer-to-peer servers, in which there is no central or master server. All servers (called nodes) are equal peers. Data is automatically distributed across the cluster using an internal algorithm, ensuring that the workload is equally shared among the servers. Or proportionately if they don’t have the same capacity.
And yet, a client connecting to the database cluster, does not need not know which node stores the data that it is trying to update or query. A client may get connected to any node in the cluster, that node acts as a coordinator to send the data to the appropriate node to store (or send the request to retrieve the data from).
As the data volume or workload increases, you can easily add new servers to the cluster, without any downtime.
RDBMS: The Performance Challenge
With huge volume of data, and millions of simultaneous users (as in the case of Facebook, WhatsApp, Google, etc) comes the challenge of delivering high performance. Users are impatient. And many times the applications are only useful if they are fast enough (consider Google Maps, WhatsApp, etc). We expect Google to display search results in the blink of an eye. We expect our Facebook page to appear within a second or so when we log in. The systems must perform well despite high volume of data and millions of concurrent users.
Now if you have worked with relational databases, you know that the recommended practices of normalization can result in a large number of tables. Try to visualize the number of relational, normalized tables for a service like Facebook. With your personal details, list of friends, your hobbies, the pages you like, your posts, the likes on your posts, comments on your posts (and comments on comments, etc.), posts of others that you have liked or commented on, pages you follow, pictures, and loads of other details about you, the number of tables may run into many dozens.
So imagine many dozens of tables, each having hundreds of billions of records. And a join being required across dozens of such large tables, in order to present the home page for your Facebook login. Even if all this data could be stored on a single server (it cannot!), the cost of the joins would be an application-killer. We know that joins can be expensive. Prohibitively expensive when they involve more than a few million records. Specially if you want a sub-second response.
NoSQL Solution: Denormalized Structures, No Joins
NoSQL databases address this problem by deliberately giving up on some of the key features that have traditionally been considered holy grails of relational databases. For example:
- NoSQL database structures are typically denormalized.
- They encourage complex data structures to be stored as a single record or document.
- Even if you want to keep your database structure normalized, join operations are not supported by most NoSQL platforms.
- If you really want to join data from multiple tables, you will have to do so in your application code.
For example, the following is a perfectly normal document (record in RDBMS parlance) to store in MongoDB (along with the syntax to store it in the database):
This is a JSON document. As you can see, you can have collections and nested structures within a document. With all the data for a person stored as one single document (record) in a collection (table), there is no need for joins, resulting in very high performance of retrieval or manipulation operations (you will need to make good use of indexes for large datasets).
Similarly, in Cassandra, a record can have one or more collection fields, such as an ordered list of family members of a person, or a set of email ids.
For the developers who have grown up on a staple diet of RDBMS concepts, such denormalized structures, and the lack of join operations appear unusual, and raise a lot of questions about the suitability of NoSQL platforms for serious enterprise applications. The mindset change can be challenging for some.
RDBMS: The Availability Challenge
When did you last see Facebook reporting that the site is down? Our life may come to a standstill if we cannot connect to it whenever we want!
Clients today expect 100% uptime with no single point of failure. Gone are the days of 99.99%.
Relational database platforms typically have a master-slave architecture. There is a server that runs the database instance, and there are clients that connect to it. This potentially makes the server a single point of failure. Of course, with high availability solutions (provided by the database platform and / or the operating system), the risk can be almost completely eliminated.
Besides, the RDBMS platforms were not originally designed for multi-datacenter scenarios. Nor for geo clusters, or the cloud. Though the RDBMS vendors keep adding such features from time to time, but the patchwork sometimes tends to be clumsier than the original.
NoSQL Solution: Replication of Data
In a cluster consisting of large number of servers, it is likely that some server will fail from time to time. But that should not result in loss of data, even temporarily. NoSQL databases handle this with automatic (but configurable) replication of data. You can configure your data to be replicated to multiple servers, so that in case of a few nodes failing from time to time, all data can still be found on some other node.
Distribution and replication can be done across geo clusters. In case of disasters, even if an entire data center goes down, data can be accessed from other locations.
RDBMS: The Challenge of Changing Structure
With relational databases, the table structures and data types need to be fixed in advance. In today’s world of agile software development, with increasing focus on rapid and incremental delivery of working software, the database structure may need to evolve. This is usually challenging, because adding a new piece of data item, or changing the data type requires an entire table to be altered, and the access to data gets hampered during this change.
NoSQL Solution: No Schema or Flexible Schema
NoSQL databases don’t care too much about defining a structure or a schema. Many of the NoSQL databases, such as MongoDB and Neo4j, just don’t have any notion of a predefined structure.
In MongoDB, for example, you can just start inserting documents in a collection (table) in MongoDB, without first creating its structure. The “db.persons.insert (…)” statement given earlier in this article, may well be the very first command you issue in a brand new MongoDB database. You can have different fields in different documents.
You want to have a new field in your database? No problem. Just start providing the new field for the new documents, or while modifying the existing ones. Does not matter if the older documents don’t have the new field. The NoSQL databases thus enable one to adapt the backend to the emergent needs of a project in an easy manner.
Cassandra, on the other hand, is slightly more like relational databases on this count. You need to issue a “create table …” statement first. But you can still add a new column instantaneously without worrying about the existing records. You can even change the data type for an existing column, which has an immediate effect on the new records, but no effect on the existing ones.
Complexity or Ease of Installation and Administration
Relational databases are complex to administer. Ask any DBA for Oracle or SQL Server. You need to learn so much, you need to monitor a lot, and fine-tune from time to time.
NoSQL databases, in contrast, are much easier to install and administer. Most of them have a very small footprint, and can be installed in less than 5 minutes for a simple configuration. Even for more complex scenarios, they take much less time to setup and require significantly less planning. They are usually much easier to administer and fine-tune.
Limitations of NoSQL
While NoSQL database platforms offer the benefits described above (and many more), there are many limitations (by design) or challenges that the developers need to be aware of. For example:
- As already mentioned, NoSQL databases normally don’t have the concept of joins across tables. If joins are really required, those must be handled in your application.
- Relational databases provide strong transaction support. However, the notion of ACID transactions is usually absent or limited in most NoSQL databases. Due to the distributed and replicated nature of the NoSQL databases, ensuring the same level of consistency as in relational databases is not natural, and requires some tuning or additional work on part of the developers or the administrators.
- There is no standard query language across the NoSQL database platforms.
NoSQL databases compromise on some features of relational databases such as normalization, joins, transactions, and consistency. These are conscious design choices, to support significantly higher volume of data, and yet provide higher performance and availability. They either don’t have any schema, or are flexible with their schema. They are also easier to install, configure and administer.