This is an introduction article to the Big Data topic, where it begins with definition of Big Data – the basic
Big Data is a big topic today, but how big is it? That’s a moving target and continue to be explored for many years to come. Regardless, it is important to start with the basic, and that is definition in the eye/mind of those know study it. Concrete definition is the foundation of knowledge and it applies the same here this topic – Big Data.
What is Big Data?
Big Data refers size of dataset that exceed the limitation of typical database and system to collect, maintain, manage and extract value out off (Manyika, Chui, Brown, Bughin, Dobbs, Roxburgh, & Byers, 2011). Big does not hold a threshold in volume nor size of the data, but it is relative to its current system and database limitations of capability and storage to draw maximum value of the data off (Manyika, Chui, Brown, Bughin, Dobbs, Roxburgh, & Byers, 2011). What this means is data volume is considered BIG when it forces us to look beyond the current prevalent methods/solutions at the time of evaluation (Jacobs, 2009).
Big in Big Data is about the challenges in data management that is beyond data volume (storage size challenge); it also involves the high velocity in data surge (data pipeline/network challenge) and data variety (data format/standard/structured/unstructured/creation/parsing challenges) (Gantz & Reinsel, 2012). Data can come in fast from multiple sources with high volume, different form/format, and expected to be processed or insight creation in seconds. Doug Laney (2001) predicted and called this a 3D data management closely twenty years ago where 3D involves three dimensions (volume, velocity and variety) – the triple V’s or 3V’s = Big Data today. Data veracity and data value are fourth and fifth V’s that were always around but then included and emphasized later to complete the big data picture (Boyd & Crawford, 2011; Zikopoulos, deRoos, Parasuraman, Deutsch, Corrigan & Giles, 2013)
The 3V’s or 5V’s Summaries of Big Data:
- Relative to current system/DB limitation
- Today we are looking at Exabyte, and even zettabyte
- (Gantz & Reinsel,2012; Mayer-Schönberger & Cukier, 2013)
- Rate of data input/change – data acceptance challenge
- High/Big Throughput of small data movement
- Rate of data analysis as data flow in – timeliness in data process challenge
- Act upon real-time/near-real-time data before information get worthless
- (Kraska, 2013; Stonebraker, Madden & Dubey, 2013)
- Diverse in data source
- Differences in data format and structure: XML, plain-text, log file, forum post, image, text message, and geolocation coordinate and etc…
- Integration challenge between structure (with schema/standard), semi-structure, and no-structure data (without any standard or schema)
- Diverse in data requires high degree in meta data to provide data context such as data source, data recording method, data sematic, data process method, and application method
- Meta data = data validity
- (Agrawal, Bernstein, Bertino, Davidson, Dayal, Franklin, Gehrke, Haas, Halevy, Han, Jagadish, Labrinidis, Madden, Papakonstantinou, Patel, Ramakrishnan, Ross, Shahabi, Suciu, Vaithyanathan and Widom, 2012)
- “conformity with truth or fact” based on online dictionary
- The truthfulness, the precision and the certainty of data
- Ones in Big Data domain would often scrutinize to find the opposite (uncertainty, imprecision, and absence of truth or fact) to his/her data
- Focus on mistakes upon data generation, hidden or incompletion of data, bias level during data collection or aggregation.
- (Zikopoulos, deRoos, Parasuraman, Deutsch, Corrigan & Giles, 2013)
- Focus on the usefulness of data at insight building stage
- Timeliness, completeness, correctness and interconnectedness, and holistic-ness for decision making
- (Boyd & Crawford, 2011)
- Big Data data source are divided into 5 key categories (Soares, 2012):
- Web and Social Media
- Machine to machine data
- Big transaction data
- Human-generated data
- (Chen, Chen, Du, Li, Lu, Zhao, & Zhou, 2013; Soares, 2012)
Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U., Franklin, M., Gehrke, J., Haas, L., Halevy, A., Han, J., Jagadish, H., Labrinidis, A., Madden, S., Papakonstantinou, Y., Patel, J., Ramakrishnan,R., Ross, K., Shahabi, C., Suciu, D., Vaithyanathan, S.& Widom, J. (March 2012) Challenges and Opportunities with Big Data: A community white paper developed by leading researchers across the United States. Whitepaper, Computing Community Consortium. URL http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf.
Boyd, D. & Crawford, K. (2011). Six Provocations for Big Data. In A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society,. doi: 10.2139/ssrn. 1926431.
Chen, J., Chen, Y., Du, X., Li, C., Lu, J., Zhao, S. & Zhou, S. (2013). Big data challenge: a data management perspective. Frontiers of Computer Science, 7 (2):157–164, April 2013. doi: 10.1007/s11704-013-3903-7.
Jacobs, A. (2009). The Pathologies of Big Data. ACM Queue, 52(8):36–44
Gantz, J., & Reinsel, D. (2012). The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East by John Gantz and David Reinsel sponsored by EMC. Retrieved April 6, 2017, from http://emc2.com.co/leadership/digital-universe/2012iview/index.htm
Kraska, T. (2013). Finding the Needle in the Big Data Systems Haystack. IEEE Internet Computing, 17(1):84–86, 2013. ISSN 1089-7801. doi: 10.1109/MIC.10.
Laney, D. (2001). 3D Data Management: Controlling Data Volume, Velocity, and Variety | BibSonomy. Retrieved April 6, 2017, from https://www.bibsonomy.org/bibtex/263868097d6e1998de3d88fcbb7670ca6/sb3000
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity | McKinsey & Company. Retrieved April 6, 2017, from http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation
Mayer-Schönberger, V. & Cukier, K. (2013). Big Data – A Revolution That Will Transform How We Live, Work and Think. John Murray (Publishers).
McAfee, A. & Brynjolfsson, E. (2012). Big Data: The Management Revolution. Harvard Business Review, October 2012:60–68.
Soares, S. (2012). Big Data Governance – An Emerging Imperative. MC Press Online, LLC, 1st edition
Stonebraker, M., Madden, S. & Dubey, P. (2013). Intel “Big Data” Science and Technology Center Vision and Execution Plan. SIGMOD Record, 42(1):44–49.
Zikopoulos, P., deRoos, D., Parasuraman, K., Deutsch, T.,Corrigan, D. & Giles, S. (2013). Harness the Power of Big Data: The IBM Big Data Platform. McGraw-Hill.
Recommend Article on this topic