Big Data
Last updated
Last updated
Big data is data that cannot be processed using traditional processing techniques. A good rule of thumb is that if we need more than one machine to process our data, chances are we're dealing with big data.
To process big data, we need programs that span several physical/virtual machines, running in a coordinated fashion, to process it in a reasonable amount of time.
Big data has the following 3 characteristics (also known as the 3 V's):
Volume: needless to mention, big data is voluminous i.e. it is produced in large quantities
Velocity: big data is produced at an extremely quick rate
Variety: big data doesn't only refer to data that can be stored in a spreadsheet or a structured database. Most big data is unstructured and consists of different kinds of data such as textual data, images, videos, audio etc
Common examples of big data include the search queries that Google receives, the images/videos/posts that Facebook receives, data produced by smartphones and smart-wearable devices etc
We need proper tools and techniques to make the most of this seemingly endless data.
The underlying principle to analyze big data is the Divide and Conquer methodology. The following figure explains how a workload can be partitioned among several machines, processed in parallel, and how the partial results can be combined to obtain a solution for the original problem.
It is important to note that speeding up the processing is not the goal here. We use Divide and Conquer only because we can't possibly process such large quantities of data on a single machine.
However, parallelization comes with its own share of issues, such as that of communication between the workers, sharing of resources etc. Also, not all workers will complete their share of work at the same time.Therefore, we need a proper synchronization mechanism.
Some questions that might arise when parallelization is implemented:
How do we assign work units to workers?
What if we have more work units than workers?
What if workers need to share partial results?
How do we aggregate partial results?
How do we know all the workers have finished?
What if workers die?
Managing multiple workers becomes challenging because we are unaware of the order in which they work, the way they share resources, in which order they complete execution etc. Operating Systems concepts such as semaphores, mutex locks, barriers, deadlocks etc become important to ensure synchronization.
Currently, we have programming models like:
shared memory: wherein workers share memory and communicate via the shared memory
message passing: wherein workers communicate among each other by passing messages
Some common design patterns in use today include:
master-slaves: one machine allocates work to several other machines and supervises their work
producer-consumer: a set of producer machines produce an output that becomes the input for a set of consumer machines and this workflow continues
shared work queues: jobs to be processed are placed in a work queue and workers complete these jobs one at a time as and when they become free
Scale OUT, not UP: this means that instead of investing in a single highly-configured machine, we must invest in many low-configuration machines and process the workload using Divide and Conquer
Move processing to the data: instead of moving large quantities of data across the network, move code to where the appropriate data is located; this will reduce network costs and latency
Process data sequentially; avoid random access: since seeks are expensive, but disk throughput is reasonable
Seamless scalability
The need for a technique to process big data gave rise to Hadoop.
The next chapter gives an overview of Hadoop.