DISCLAIMER: As with all other material on this blog, these are my thoughts and do NOT reflect the opinions of my employer.
I really like the tagline on our logo: big data. fast insights.
But leaving the marketing aside, what does it mean really? What is all the hoopla about big data analytics?
The way I look at things, a few key observations here are:
- Data is increasing. This is almost self-evident, so I won’t bother with presenting any evidence.
- Data is driving businesses more than ever. Whether it is search, advertising, insurance, finance, health care, governance — data is becoming an integral part of more and more business processes.
- Finally, data movement is slow. And I mean really really slow, compared to our processing and memory speeds. Once you go into the range of hundreds of terabytes of petabytes of data, you really don’t want to keep moving around that data into isolated silos for doing analytics.
Clearly, none of these observations is particularly new or insightful. However, I do think some of the implications of these observations are quite powerful and were new at least for me. For instance, (3) implies that once you have accumulated a lot of data in one place (imagine hundreds of TB or more), it is extremely difficult and time consuming to move that data around. This, in turn, means that more often than not, data is likely to reside in a single place.
Traditionally, it was not uncommon to have a large data warehouse that would be the repository of all data. Then smaller data sets could be carved out from this master data set (also known as data marts) as required. This approach is becoming increasingly unfeasible. Carving out 100TB data marts from a 1PB data warehouse is simply not going to scale.
At the same time, it is clear that a one-size-fits-all approach to data storage and analysis is not practical either. Some data sets naturally lend themselves to a relational data model, while others might be more suited to unstructured processing (Hadoop) or document oriented processing (CouchDB or MarkLogic) or graph analysis (Neo4J) and so on. Forcing a single model or access mechanism down all customers’ throat is not tenable.
So what would the ideal platform for big data analytics look like? One that allows you to store and access data in various ways, seamlessly.