Home »
Articles
What is Bigdata and Hadoop
Synopsis
Have you ever had a notion how the data is being coped? We live in the data age. It’s not easy to quantify the sum volume of data stored electronically, but an IDC estimate put the size of the "Digital universe" at 4.4 zetta bytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zeta bytes. A zettabyte is 1021 bytes, or equivalently to one million petabytes, or one billion terabytes. That’s more than one disk drive for every person in the world. Creation of Data is easy, whereas storing and processing it a quite tricky .With the growing Internet today the Data has been escalating 10times immensely, than in the earlier. Researchers do found that at present for every minute nearly middling of 6TB of an unstructured data is being uploaded where storing and processing it is a tough job. That voluminous data is called as "BIGDATA".
BigData Generators
- The Internet Archive stores around 2 Peta bytes of data, and is growing at a rate of 20 terabytes per month.
- Ancestry.com, the genealogy site, stores around 2.5 Petabytes of data.
- The Large Hadrons Collider near Geneva, Switzerland, produces about 30 Peta bytes of data per year will produce about 15 Peta bytes of data per year.
- The New York Stock Exchange generates about one terabyte of new trade data per day.
- Facebook hosts approximately 10 billion photos, taking up one Petabyte of storage.
- Sensors, Online shoppings,NCDC,CC CAMS, Airline Data, Hospitality data and even more. If i go on saying, it would also be a Bigdata, Ha ha! kidding.so, everything in this Internet world which is beyond the capacity can be treated as Big Data.
- Do you know that in 100 percent world of data 90 percent was generated in the last 4years where as the rest of 10% was been long back ago?
History to Today
In the 90's the maximum hard disk capacity 1GB-20GB whereas RAM is 64-128MB and the reading capacity of it in the region of 10KBPS.In contrast, we have 1TB hard disk, 4-16GB of RAM and its reading capacity is minimum from 100MBPS.As you watch your HARD DISK gets augmented to thousand times and consequently your RAM and Reading capacity was also Incremented.
Brief Interpretation
- For example: The capacity of a room is 10 tables, if the number is beyond 10 we have to choose another alternative for the storage.
- Example2: A farmer had worked hard and fetched 10 rice packets it is more than enough to store at his home. By the flowing day the farmer has generated 100 rice packets where he cannot able to stock up in his house, at the same time he cannot throw them out. Hence he chose a substitute, say Godown.
- As the above Portrayed while the Data is generating it should be Stored but not Discarded. The same theory applies to the Technology wise.1TB of your data can be stored in External hard disks, if 2TB or 3TB or else if it gets amplified twice and thrice day by day? You "Data Maintenance centers" (SERVERS) where every Enterprise, Government Data storage and Auditing is done like such as IBM, EMC OR NETAPP SERVERS, called as SAN BOXES. Anybody can loom the centre and preserve the Data over there.
- Let's assume that i have an Avatar movie in my file system, at now i don’t want to watch it so then why should i store in my Local file system? But some or the other day i want to watch it. So it’s better to store in the Cloud (Interconnection of Servers).Anything that you are storing means, at one day you wanted it to be processed.
- So here the processing can be done by using any Language say Java, Oracle, Python Script etc. Other common challenges that arise include the significant cost of resources, handling communication latencies, handling heterogeneous compute resources, synchronization across nodes, and load balancing.
- Writing a 100KB of code to process the 2TB of data in the centers is better rather than placing 2TB of data in your file system. But the computation cannot be done on the Data Maintenance Servers because Computations are of "Processor Bound", means wherever you the run the program for only that system you have to fetch the data and store the data before "Hadoop" (Solution for Big data) got introduced.
Arising Consequences
- Volume: Brisk increase of data to GB-TB-PB-ZB and so on.
- Velocity: Every time transmission creates consequence.
- Variety: Whereas RDBMS perceive how to store the structured data and process it by writing Queries, but it cannot grip the unstructured &semi structured data. Because, out of 100percent 70-80 percent is of all unstructured and semi structured data, the semi structured data is which obtained by the "Log Files". If we are logging into Gmail the subsequent log file will be stored in Gmail Server and at the same time there is no restriction that a person shall hold one Gmail or Yahoo account, there might be three to four or many. Just say, a user having 4 Gmail Accounts and using them all 5times in a day, obviously every time you Sign in a log file will be generated & stored. So 4*5=20 log files will be stored in a day if you Sign into it at least 5times in a day. The same for Yahoo 3 accounts with the daily usage of 4times gets 3*4=12 log files, for Facebook at least 5times in a day with 2 accounts is 10 log files, total my log files are 32 for a day are generated which is only for a Single user. As per the survey in 2015, approximately 400 crore people are the Internet users, 99 percent on average they have at least 2 Gmail accounts and other accounts where the daily base which constitutes Huge amount of data. All these are to be stored and processed. Else that scale makes failures such as Disk failures, Compute node failures, Network failures, and so on a common occurrence making fault tolerance a very important aspect of such systems.
- The three Volume, Velocity and Variety together are to be called as a BIGDATA, defined by the IBM.
Introduction of Hadoop
- It is feasible to store 200GB of Data in 500GB of Hard Disk but it is not quite ease to process the same Data whenever required, consumes more time. So in order to make the repossession time to be less, in that sense "HADOOP" got introduced.
- What it literally says is, suppose the construction of your House takes 12months with only 1member working. If the same work is assigned to 12 members it will be completed in 1month of time. Hope you understood? The same theory applies here for the Hadoop. Storage, processing, and analyzing petabytes of data in a meaningful and timely manner require many compute nodes(systems) with thousands of disks and thousands of processors together with the ability to efficiently communicate massive amounts of data among them. As you can infer, developing and maintaining distributed parallel applications to process massive amounts of data while handling all these issues is not an easy task. This is where "Apache Hadoop" comes to our rescue.
- NOTE:Google is one of the first organizations to face the problem of processing massive amounts of data. Google built a framework for large-scale data processing borrowing the map and reduce paradigms from the functional programming world and named it as Map Reduce. At the foundation of Google, Map Reduce was the Google File System, which is a high Throughput parallel file system that enables the reliable storage of massive amounts of data using commodity computers. Seminal research publications that introduced Google Map Reduce and Google File System concepts can be found at
http://research.google.com/archive/mapreduce.html and
http://research.google.com/archive/gfs.html.
- Apache Hadoop Map Reduce is the most widely known and widely used open source Implementation of the Google Map Reduce paradigm. Apache Hadoop Distributed File System (HDFS) provides an open source implementation of the Google File Systems