You cannot open a business or IT trade magazine these days without seeing a major article about the impact of “Big Data” on the Enterprise. There is often a lack of discussion of what constitutes Big Data with the unstated assumption that any mass of Enterprise data constitutes Big Data. The “Three V’s” of Big Data are often cited as its defining characteristics: Volume, Velocity, and Variety. But Big Data systems should also be examined and compared to RDBMSs and other systems in relation to another three letter acronym, the CAP theorem, proposed in 2000 by Eric Brewer. The “CAP” of Brewer’s theorem stands for Consistency, Availability, and Partition Tolerance. Brewer proposed that networked systems can have two of these three attributes, but trade-offs prevent a system from having all three characteristics.
Consistency in the CAP Theorem is essentially equivalent to what is known for traditional RDBMSs as the consistent trait of “ACID Compliance” (Atomic, Consistent, Isolated and Durable). This means that a given transaction or interaction will ‘look the same’ across all environments when it is finished (all Inserts, updates or deletions to all affected tables will be the same). Availability refers to the capability of the system to respond to interactions or messages. For example, if you hit “Submit” on a form, does the system respond in some way, or are you left wondering what happened? Partition Tolerance means the system can sustain a failure of a node (whether a particular server or a whole data center) and still be operational. “Fail-over” procedures and mechanisms allow systems to be Partition Tolerant.
Because Big Data systems are distributed across tens, hundreds, or even thousands of servers as well as globally across data centers on different continents (i.e. Akamai content delivery), they excel in providing Availability and Partition Tolerance. However, they rarely target transactional Consistency as their primary benchmark. Most Big Data systems instead strive for “Eventual Consistency”, meaning that within a given timeframe (i.e., a handful of seconds up to many minutes), data will be consistent across all environments. This characteristic is perfectly acceptable for most web content systems, but would not be acceptable for systems such as banking and trading, or real-time monitoring systems in operational settings such as manufacturing or medical systems. In other words, it might not be of utmost importance that you are viewing your sister’s latest comments on the picture you uploaded to Facebook from your phone, or that you are seeing the latest trending Pop culture ‘tweets’ with imbedded links to a YouTube video. But it is very important to both you and your bank that the available balance in your checking account is more than the amount of your current purchase with your debit card ATM, whether you are at the local grocery store or if you are at Harrod’s Department Store in London, England.
Traditional RDBMSs are subject to these same CAP restrictions, but RDBMS vendors have had decades to provide tools and methods to minimize the impact of the choices system architects have to make in dealing with these trade-offs. For instance, transaction logs and rollback methods, mirroring, and fail-over capabilities allow database administrators to recover from system failures both locally and globally to a given point in time. Large companies that deal with global transactions have sophisticated procedures to provide real-time communications and transactions and to mirror whole systems across continents while still providing disaster recovery.
For this reason, most massive web-based systems today are a combination of Big Data systems and RDBMSs. For instance, when you purchase a book online, when you log in and browse titles and reviews, you are interacting with Big Data. But when you enter your credit card information and submit your order, you are interacting with a traditional RDMBS. The ability of the vendor to analyze other customers’ purchases to suggest ‘Other Titles You Might Enjoy’ is based on their analysis of Big Data browsing and purchasing patterns and the web traffic of millions of customers who share your characteristics. To do this kind of analysis in a traditional RDBMS would be extremely time consuming and computationally expensive. To do it in a Big Data system is much faster and easier.
Many vendors and open source projects are seeking to both ease system manageability and narrow the gap between the security and assurance provided by traditional RDBMSs and Big Data systems. Newer Big Data based data management systems, often called “NoSQL” systems (for NotOnlySQL) such as H-Base, Cassandra and MongoDB are attempting to solve many of these CAP theorem trade-off issues. Vendors in this space include 10gen, Couchbase and DataStax. Vendors such asCloudera, Hortonworks, and MapR offer tools that integrate the most commonly used Hadoop-based stacks of Big Data management and analysis systems. The large vendors of traditional RDBMSs such as Oracle, IBM, SAP, and Microsoft all offer either their own versions of NoSQL databases and Big Data management tools, or extensions to their existing products, to try to address the need to integrate both traditional RDBMSs and Big Data systems. These products are evolving rapidly with new releases offering more and more functionality and integration.
This is an exciting time for Systems Integrators and Data Management Professionals as Big Data systems, both in-house and “in the Cloud”, provide previously unavailable flexibility and scalability. But IT management must keep in mind the inherent trade-offs that distributed systems face, whether they are new Big Data systems, or traditional RDBMSs.
Plaster Groups Data & Analytics Consultants can help you manage your Big Data solutions.