To SQL or NoSQL? There is no clear cut answer
Poke around the infrastructure of any startup website or mobile app these days, and you’re bound to find something other than a relational database doing much of the heavy lifting. Take, for example, the Boston-based startup Wanderu. This bus- and train-focused travel deal site launched about three years ago. And fed by a Web-generated glut of unstructured data (bus schedules on PDFs, anyone?), Wanderu is powered by MongoDB, a “NoSQL” database—not by Structured Query Language (SQL) calls against traditional tables and rows.
But why is that? Is the equation really as simple as “Web-focused business = choose NoSQL?” Why do companies like Wanderu choose a NoSQL database? (In this case, it was MongoDB.) Under what circumstances would a SQL database have been a better choice?
Today, the database landscape continues to become increasingly complicated. The usual SQL suspects—SQL Server-Oracle-DB2-Postgres, et al.—aren’t handling this new world on their own, and some say they can’t. But the division between SQL and NoSQL is increasingly fuzzy, especially as database developers integrate the technologies together and add bits of one to the other.
The genesis of NoSQL
In the beginning—about 12 years ago—there was structured data, and it was good. Usually consisting of things like numbers, dates, and groups of words and numbers called strings, structured data could be displayed in titled columns and rows that were easy to order. Financial companies loved it: you could put customers’ names and account balances into rows with titled columns, and you could put the data into tables and do other things with it, like join tables and run queries in a language, SQL, that was pretty close to English.
But that data was live, stored in operational systems like Enterprise Resource Planning (ERP) setups. People wanted to squeeze intelligence out of it, to answer questions like how sales were going or how various regions were doing. The IT department was loathe to do it, though, wary of letting people hit on operational systems. So data was divided into operational and analytical, and thus data warehouses were born. As the questions analysts wanted to ask data became more complex and the amount of data stored became increasingly vast, the databases that held all this began to look less and less like SQL.
SQL-based relational servers are built to handle the demands of financial transactions, designed around the tenets of ACID: Atomicity, Consistency, Isolation, and Durability. These characteristics ensure that only one change can be written to a data field at a time, so there are no conflicting transactions made. If you withdraw $50 from an ATM in Boston, and your spouse simultaneously withdraws $100 from the same account at an ATM in Tokyo, the balance change has to reflect both—not to mention, it must cancel out one of them if the account is overdrawn. Consistency means that no matter which database server you ask, you’ll get the same value no matter who’s asking or when.
ACID, though, doesn’t matter much when you’re just reading data for analysis. And the database locks that SQL databases use to protect database consistency in transactions can get in the way. The Internet ushered in what VoltDB Director of Product Marketing Dennis Duckworth calls “Web-scale attacks” on databases: as in, up to hundreds or even millions of people wanting access to the same data sources at the same time. Think of the “not available” messages you get when you try to access a Web service or site. The database is simply staggering under demand. Too many people trying to change data at the same time gave rise to locks as the databases struggled to maintain consistency.
How do you scale an Internet business to handle that? It used to be that you’d buy a bigger server—an HP Superdome, say, or a huge mainframe that could scale up. But that got expensive fast. Businesses turned to buying cheaper, commodity boxes to scale out instead of up, distributing the database out over hundreds or even thousands of servers.
But outside of financial transactions, you don’t always need the most up-to-the-second abilities to write data. “Pretty close” can be good enough, such as when the database is just overwriting old data and it’s OK to get the results a little wrong for a few minutes. Think of Google’s indexing. Its search engine indexes the entire visible Internet every day. It can’t write the data while people are simultaneously Googling the same search terms, so it doesn’t give us the most up-to-date result when we search. It gives us the last, best answer it can.
That setup is a little sloppy, and it wouldn’t work for financial transactions. But that ability is just fine for developers who need drop-dead-fast results, not pinpoint perfect.
NoSQL databases are often associated with “big data” tasks, handling large volumes of data in various forms:
- Columnar databases, for dealing with massive collections of simple structured data, such as log files. Google, for its part, has Bigtable: a distributed, column-oriented data store to handle structured data associated with the company’s Internet search and Web services operations, including Web indexing, MapReduce, Google Maps, Google Book Search, “My Search History,” Google Earth, Blogger.com, Google Code hosting, YouTube, and Gmail.
- Key-value and other unstructured databases such as MongoDB, which use rules defined by programmers to comb through large amounts of unstructured data, including documents and websites.
- High-speed databases for processing large streams of data with extremely low latency. Those diverging tasks have driven the evolution of some very different technologies that often get lumped together as NoSQL, and some that are labeled as “NewSQL.” These provide some of the things that SQL-based relational databases do but are tuned heavily for handling the demands of big data. VoltDB falls into this category: it’s a high-speed SQL database. Couchbase is another high-speed, more general-purpose, NoSQL database.
Hadoop isn’t even really considered a database, but when you look at databases, you’ll no doubt come across it. It was designed as a cheap way to store data and process it someday, in some way. Currently, it’s huge. Hadoop is everywhere online: Facebook, eBay, Etsy, Yelp, Twitter, Salesforce. It’s in the physical world, too, as companies in entertainment, energy management, and satellite imagery look to analyze the unique types of data they’re collecting. Eye-popping forecasts predict that the Hadoop market is on track to hit $84.6 billion by 2021.
Hadoop is enabled by a technology Google created called MapReduce, a way to process and generate large data sets with a parallel, distributed algorithm on a cluster. Google wrote a few papers on it, and then it got picked up by Yahoo programmers who brought it into the open source Apache environment. MapReduce evolved into what Yahoo hoped would be an answer to its search engine woes: an open source platform called Hadoop that collects data from sources such as social media, customers, and financials, storing it in a data warehouse to undergo the MapReduce process. It has made it easier and cheaper than ever to analyze the data being churned out by the Internet. Fun database fact—Hadoop was named after a toy elephant.
But Hadoop isn’t for everyone. It’s the opposite of plug and play. Forget the user-friendly language of SQL; Hadoop needs trained, talented experts who know how to manage big data. Developers have released software solutions like Cassandra, HBase, Hive, Pig, and ZooKeeper to interact with this SQL-unfriendly stuffed elephant, but Hadoop also needs a front end. The list of what it needs goes on, as you can see on a chart put together by Gigaom three years ago that displays as many layers as a pallet of Sacher Tortes.
Patrick Lie, a big-data engineer, encountered Hadoop when he was an engineering manager at TokBox, a WebRTC platform for embedding live video, voice, and messaging into websites and mobile apps. The company used Hadoop to mine data in server logs and client communications. That included things like video bit rate. It was a typical use case for Hadoop, he said. The work involved “a lot of data ingestion.”
“It was a couple million rows [of data] a day,” Lie told Ars. “[We’d] pull it in, process it, and make it available for analysts… [We didn’t] care how fast it was processed. It’s moved in one direction. You put it through the pipeline, and you put it out in a cleaner, more usable fashion. It’s not used again by customers unless it’s manipulated heavily.”
Here’s a typical piece of analysis his team would get out of Hadoop: TokBox had videoconferencing technology happening inside of a browser. Different browsers would give different video performance. Lie’s team would get information such as what type of browser was in use, what the video bit rates were, and where the users were located, perhaps based on IP address. Aggregating that data across many users, TokBox could come away with a realization such as, “Hey, Chrome works better than Internet Explorer in Sweden.”
It’s theoretically possible to do something like that with a SQL database, Lie said, but in practical terms, “you’d be spending so much for a traditional SQL database, or it would take forever to execute. Before Hadoop, somebody would write custom software to do this kind of processing.” It’s not so much a black and white, SQL vs. NoSQL situation, of course: TokBox still kept SQL databases around. That’s where his team fed portions of Hadoop output for analysts to query more easily.
There are some places where you don’t want to use Hadoop, at least on its own—anywhere you need real-time data “in any way,” Lie said. If you’re looking for real-time work, there are solutions that “sort of work on top of Hadoop,” but on its own, Hadoop isn’t meant for it according to the engineer.
Another potential problem can arise from writing inefficient queries or trying to do jobs without a basic understanding of how Hadoop works. One mistake that’s easy to make, Lie said, is to choose the wrong key to reduce on in key-value pairs. If you choose the wrong key in the map phase, the worst possible case is that the key is a random number between, say, 1 and 1 million. That would create 1 million keys. “It’s easy to create inefficient jobs,” Lie said. “It can take an hour, or it can take a day. It’s easy to make those mistakes.”
Scott Gnau, CTO at Hadoop distributor Hortonworks, says Hadoop use cases cut across many industries, all of which are gathering Web data. The list includes retailers, financial services (and yes, Hadoop takes in SQL aggregates, “believe it or not,” he said), Internet of Things data from resources such as connected cars or healthcare. Think of all the data, both unstructured and structured, that’s created when you visit your doctor. Who’s the patient? Who’s the doctor? What was the prognosis? X-rays, CAT scans, clinical trials, tests, drug interaction matrices can all be images, all unstructured data. Health insurance payers, providers, and manufacturers and many more are taking advantage of Hadoop to handle all that data, which was once buried in silos in an all-SQL world.
“For 30 years in the IT industry, we’ve had traditional development models,” Gnau said. “We grab users, put them in a room, get their requirements, build out the design, look for data to find structures and reporting, implement and hand it back to users. It was a traditional waterfall approach. Even agile development starts with requirements and winds up with results.
“In Hadoop, you need to challenge that and turn it around. Data is not structured. Users may not know what the requirements are. It’s an inverse process—you land the data, find data scientists to find relationships that are interesting and appealing, and turn that into requirements that yield a system. It’s the opposite approach to a traditional process.”
Beyond the not-really-a-database Hadoop, there are an awful lot of actual databases to choose from. DB-Engines ranks 264 of them by popularity. MongoDB is at No. 4 of all databases, including both SQL and NoSQL, and it’s the most popular NoSQL database.
Warren said he can’t remember what SQL database Wanderu was using before going NoSQL. But he does know that the architect of what it uses now came in, looked at the SQL database, and said that it wasn’t going to work.
“He recognized that the original solution was trying to do too much,” Warren said. “It was trying to save a bunch of data and routing info. And the way it was constructed, it didn’t make a lot of sense for what we needed. When you take those things and flatten them out, turns out you have a nice set of documents, and all relations and routing requirements fit really well in a graph database. So we use MongoDB as a primary data store for all single trips, and if we have to route you between multiple places, we’ll query [graph database] NEO4J and then return to MongoDB to… stuff [the journey’s legs] together and return them on the site to the users.”
The Web generation’s developers love this stuff. MongoDB started, actually, at hackathons. Hackers loved it for the same reasons that 30 of the Fortune 100 now use it. Namely, MongoDB VP of Strategy Kelly Stirman says that it was the first database to be designed from the onset to make developers “maximally productive.”
“Today, developers work in object-oriented programming languages, and the way they think about data and managing data is massively different than the way a relational database works,” Stirman says. “If you look at a hackathon, there’s enormous value in [production] and speed. MongoDB lets you go faster in terms of developing apps than any other database. That appeal for hackers is extremely appealing to business. If I can build apps fast, I can be faster than the competition. I can adapt to market changes very, very quickly.”
In the days when relational databases held sway, the biggest costs of building apps had to do with hardware and storage, Stirman said. How long it took to develop them didn’t really matter—at least, it paled in comparison to hardware costs. But now, you can run servers on Amazon for pennies an hour. The major cost now is the cost of the developers cranking out the apps.
One of the hackathons that Stirman says people like best had to do with MetLife and the years it spent struggling to build a centralized customer call center on a relational database that could handle all its insurance products (auto, life, and homeowners). At an internal hackathon, somebody put together something interesting with MongoDB within hours. It went from zero to production in 90 days.
Another MongoDB loyalist is YouGov, an international, Internet-based market research firm. CTO Jason Coombs says that YouGov has seen a 70-percent reduction in storage capacity after migrating to the latest release, MongoDB 3.0. That means that by turning on compression and replicating data by WiredTiger nodes—WiredTiger being the default storage engine starting in MongoDB 3.2—the company saved 70 percent of the space it formerly used.
That’s a big gain, Coombs said, given that YouGov is collecting 5 gigabits of data an hour, and they don’t want to just shove it on a shelf. They want to use that data indefinitely to query across the historic data set.
YouGov had considered migrating to Microsoft SQL Server but found it couldn’t do a number of things. The company had two competing survey systems: one on Microsoft .NET and one based on using Python with MongoDB. The Python group was able to create features much faster, Coombs said. YouGov was also using SQLite to store text chats, but that database was hitting a wall when it came to search response times.
Then, too, there was Postgres: a relational database that’s still one of YouGov’s default databases along with MongoDB. YouGov uses Postgres to keep track of analysts and their contact information, to know who they are by tracking their preferences and opinions and to get a sense of how they’re going to feel about something by extrapolating from population. YouGov could have taken that to MongoDB, Coombs said, but it fits well into Postgres. It’s one long column of key-value pairs.
Fun fact: Couch is an acronym for cluster of unreliable commodity hardware. Couchbase Server is an open source, NoSQL, distributed, JSON-based document store.
It’s considered to be a CP type system—meaning it gives consistency and partition tolerance—when viewed through the lens of what’s known as the CAP, or Brewer’s, theorem. That acronym stands for consistency, availability (the system’s always there to write to and, hopefully, read from), and partitionability (you can run your database across multiple servers). Take your pick, the theorem states, since a distributed computer system can only guarantee two out of three.
“There are lots of companies that predict they’re going to have hundreds of users, [such as in] a high Christmas rush,” Katz says. Couchbase fills that need: it can add in capacity, and users are spared for paying for it when that capacity’s no longer needed. Online gaming is this kind of hits-based business. They have loads of customers and want something that can grow with them but won’t force them to spend huge money on infrastructure for a game that might bomb.
For an example of a Couchbase Server-based company that didn’t bomb, consider “Draw Something.” It’s a mobile app in which players take turns drawing a picture to prompt their partner to hit on the guessword they’re trying to get across. It was a huge hit. Within five weeks of its launch, it had been downloaded 20 million times.
Couchbase CEO Bob Wiederhold says the scalability has more facets besides stretching to meet that kind of demand. It’s about scaling to increase the number of database nodes to meet Black Friday demands while also being able to scale up or down in one datacenter to support varying levels of users. This is what’s called Cross Datacenter Replication, or XDCR—the ability to replicate data from one cluster to another cluster, primarily for disaster recovery. Wiederhold says Couchbase is also good at filtering to respect some countries’ laws against moving personal data from one datacenter to another region. That’s an issue cropping up in European countries after Edward Snowden’s revelations about US surveillance.
Couchbase has come a long way, Wiederhold says, and is moving toward becoming a more general-use database that can be used in any facet of a business, more like an Oracle. “Five years ago, we were initially a cache. We supported cache-based use cases in 2011,” he said. “We expanded so we were a key value database. That allowed us to support more technical use cases. Then we expanded it further and became a document database. We could begin indexing and building queries on JSON documents we stored.”
But in an agile development area, Couchbase wasn’t so agile. Wiederhold says the 4.0 release fixed that. That’s when Couchbase released N1QL: a SQL-compliant query language, pronounced Nickel, for JSON to marry the power of SQL with the flexibility of JSON.
“You start out trying to focus on a narrow set of use cases,” Wiederhold says. “As you build out the product, you expand. We feel with the 4.0 release, we have a truly general-purpose database that covers a broad set of use cases.”
VoltDB and the return of SQL
In 2008, with SQL-confounding Web data flooding the world and object-oriented programming all the rage as the tool to handle it, SQL was sneered at as if the venerable, powerful query language were a paisley polyester shirt. But SQL isn’t going anywhere—it’s evolving, thanks to advances in data science. By 2013, SQL was experiencing a renaissance, and its power became manifest in NewSQL. This is a reworking of relational database management systems that promised extreme scalability for online transaction processing (OLTP), just like NoSQL systems, and the same valuable guarantees of ACID compliancy that traditional RDBMSes had always promised.
VoltDB is one example—a distributed, in-memory, massively parallel NewSQL relational database that clearly shows that SQL can be made to scream. The company was co-founded by Mike Stonebraker, winner of the 2014 ACM Turing Award and the computer scientist behind many of the relational databases in the market today, including Ingres and PostgreSQL.
Recall Brewer’s theorem? How you can have two out of three when it comes to database consistency, availability, and partitionability? Some of the NoSQL database creators decided that availability is critical. Consistency can wait. Others decided that consistency was critical—what you write into Boston has to match what gets written in Japan, so availability will have to take a hit.
The developers of Volt had other ideas. “Volt decided there was a need to look at [data] immediately,” says VoltDB’s Dennis Duckworth, director of product marketing. “[It’s important to have the] ability to act right away. Volt takes the view that data’s more useful when it’s fresh. It gets stale. We like to sit right on the stream of data coming in and react to it. Think of sensors, like those used with heavy machinery. Suddenly, the temperature spikes. You want to react quickly to address that. [Otherwise,] a week later, the machine’s already burnt out. With Volt, you can sense the temperature spike and have an alert sent out, and have a signal sent out that shuts down the system right away. We see fast data as different from big data.”
Financial services are another area where transactions—knowing exactly the same amount in an account and having all systems agree—are critically important. A NoSQL database promises they’ll “eventually” be consistent? Forget it. Such a business needs ACID compliancy all the time, Duckworth says.
“NoSQL guys can’t do it,” he says. “If they try, it’s up to an application developer to write lots and lots [of code] to enforce consistency.”
It’s increasingly apparent that for many, it’s no longer an issue of SQL vs. NoSQL. Instead, it’s SQL and NoSQL, with both having their own clear places—and increasingly being integrated into each other. Microsoft, Oracle, and Teradata, for example, are now all selling some form of Hadoop integration to connect SQL-based analysis to the world of unstructured big data.
As Teradata General Manager of Enterprise Systems Dan Graham tells it, the move to embrace Hadoop was pushed largely by one big customer. Netflix was beating Teradata over the head until the company broke down and dragged everything into the cloud—Amazon Cloud, where most of your movies are now sitting, along with a bunch of other data. The result was a hybrid environment of Amazon Hadoop plus Teradata Cloud for Hadoop, which aims to take care of the heavy lifting with Hadoop monitoring, maintenance, and installation. Teradata is now actually selling Hadoop itself and using its own SQL query optimizer technology to serve it up to business customers’ tools.
“If you know what you’re doing, [Hadoop is] really not competing with us,” Graham said. “It’s the difference between a big Ford truck and a BMW sedan. Do they compete? A little, yes. But if you do best-pick engineering, you’ll get what you need. If you’re carrying refrigerators, or bags of cement, you don’t want to put them in a BMW. You put it in a Ford F150.”