This article is about data terminologies that includes: Data Lake, data swamp, data mart, data warehouse, data cube, data stream and data virtualization. The most important element about this article is about correlation between all these terms.
These Information Technology (IT) terms have been around while. Some terms are a lot older than the others, and more often one is derived from other or defined as a sub-definition for the other. They are confusing in definition, and more importantly they are often used or applied with misconception. This article is to synthesize common understanding, and perspective about definitions and correlation for these terms.
What is Data Lake?
Data Lake is an architectural strategy and an architectural destination (O’Brien, 2015).
Data Lake as architectural strategy
Data Lake as architectural strategy comes with intension to provide capabilities that support following capabilities (TeraData, 2014):
1. Capture and store raw data
2. Low cost scalability
3. Store both structure and unstructured data
4. Transform data
5. Structure definition on demand – schema on read/demand
6. Low schema/no schema on write
8. Handle big data that includes but not limited to:
Machine and sensor data
9. Design at time of need
Data lake as architectural strategy also includes (Khanna, 2016; Rivera, 2014):
1. Provide agility and accessibility for data analysis
2. Remove data silos by consolidation
3. Allow data input in varies of format
4. Cut constraint in term of data structure and limit in schema
Data Lake as architectural destination
Data lake as architectural destination refers to it as a system or an ecosystem (Elliott, 2015; Intersog, 2016; Khanna, 2016; O’Brien, 2015; Rivera, 2014; TeraData, 2014), where an ecosystem is an ecological community/environment of multiple systems that work as a unit. Data lake is often defined as description of an Apache Hadoop ecosystem – a combination of multiple systems that closely integrated and serve same purpose (TeraData, 2014);
Data lake misconception
Data Lake does NOT has to be Hadoop, it can be built using other systems or different data store (TeraData, 2014). Data lake is often thought of an enterprise data management; but NOT really yet (Rivera, 2014). Key components that are often required from enterprise data management but are purposefully removed or limited for flexibility that includes: data quality, data lineage, and data governance (Rivera, 2014). Data Lake does NOT exist to replace data warehouse (Intersog, 2016; TeraData, 2014). Data lake is a more fluid approach in term of data write (any schema/structure) and data read (multiple schema as different users required) when compare with data warehouse (Intersog, 2016).
What is data swamp?
Data swamp is described as a data dump storage where data dump means data comes without context, Meta data nor some sort of control on the data. Data dump makes data useless, because the data cannot be analyzed (Intersog, 2016). Data lake has the highest chance of turning into a data swamp with 1-no sense of data governance (Rivera, 2014) and 2- without proper meta and data quality assurance (Khanna, 2016).
What is data mart?
Data mart is a staging area for data that serve particular read segment/business units (Intersog, 2016; Rouse, 2014A). Based on Rouse (2014A), data mart is often used for tactical strategy – serve immediate need, be a subset of data warehouse.
What is data warehouse?
Data warehouse is a structured repository that house a consolidation of all data of an organization; which is based on a single common framework (Intersog, 2016; Rouse, 2014A). Data warehouse is used to serve entire population of enterprise users using single/common schema (Rouse, 2014A). Data warehouse comes with strategies/capabilities such as (TeraData, 2014):
1. Create a single version of the truth of data from multiple source
2. Support batch job
3. Support hundreds to thousands of concurrent users who are performing reporting or analytics tasks
4. Most data access is powered by SQL
5. A highly designed, and architect system
1. High designed data store
2. Canonical form of data – schema on write
6. Serve multiple purpose
7. Used by hundreds/thousands of application
8. Scale at moderate cost
9. Clean, safe, secure data
10. Transform one use many
What is data virtualization?
According to Rouse (2014B), data virtualization is an approach to data management that allow application to retrieve and manipulate data without requiring technical details about the data – Abstraction layer for ease data access. This approach hides technical detail and provide technology/pattern/format agnostic for data access (Rouse, 2014B)
What is data cube?
Data cube is data broken down into data dimension, where each dimension = table/source that are join/connected at a reference point – to increase detail of data (Intersog, 2016). Data cube is often used for data report and presentation which allow business consultant to easily add details to data composition and get a holistic picture of a subject/object.
What is data stream?
Data stream is a data pipeline/process/thread that move data from a source system to a destination.
Overall Picture of All These Systems
Bhardwaj, A., Deshpande, A., Elmore, A. J., Karger, D., Madden, S., Parameswaran, A., … Zhang, R. (2015). Collaborative Data Analytics with DataHub. Proc. VLDB Endow., 8(12), 1916–1919. https://doi.org/10.14778/2824032.2824100
Cloudera. (2016). Neil-Raden-Data-Discovery-Analytics-and-the-Enterprise-Data-Hub-Volume-2.pdf. Retrieved January 11, 2017, from https://www.cloudera.com/content/dam/cloudera/Resources/PDF/whitepaper/Neil-Raden-Data-Discovery-Analytics-and-the-Enterprise-Data-Hub-Volume-2.pdf
Elliott, T. (2015, February 10). From Data Lakes to Data Swamps. Retrieved January 11, 2017, from http://www.zdnet.com/article/from-data-lakes-to-data-swamps/
Intersog (2016). “What Is The Difference Between Data Lakes, Data Marts, Data Swamps, And Data Cubes?” Intersog, October 5, 2016. http://intersog.com/blog/what-is-the-difference-between-data-lakes-data-marts-data-swamps-and-data-cubes/
Khanna, A. (2016, April 15). How to Keep Your Data Lake From Becoming a Data Swamp. Retrieved January 11, 2017, from http://www.reltio.com/about/news/2016/4/how-to-keep-your-data-lake-from-becoming-a-data-swamp
O’Brien, J. (2015). Data-Lake. Retrieved from https://hortonworks.com/wp-content/uploads/2012/06/Data-Lake-Report-Final-1.pdf
Rivera, J. (2014, July 28). Gartner Says Beware of the Data Lake Fallacy. Retrieved January 11, 2017, from http://www.gartner.com/newsroom/id/2809117
Rouse, M. (2014A). What is data mart (datamart)? – Definition from WhatIs.com. Retrieved January 12, 2017, from http://searchsqlserver.techtarget.com/definition/data-mart
Rouse, M. (2014B). What is data virtualization? – Definition from WhatIs.com. Retrieved January 12, 2017, from http://searchdatamanagement.techtarget.com/definition/data-virtualization
TeraData. (2014). TeradataHortonworks_Datalake_White-Paper_20140410.pdf. Retrieved from https://hortonworks.com/wp-content/uploads/2014/05/TeradataHortonworks_Datalake_White-Paper_20140410.pdf