Big Data

What is Big Data?

Big data is data that cannot be processed using traditional processing techniques. A good rule of thumb is that if we need more than one machine to process our data, chances are we're dealing with big data.

To process big data, we need programs that span several physical/virtual machines, running in a coordinated fashion, to process it in a reasonable amount of time.

Big data has the following 3 characteristics (also known as the 3 V's):

  • Volume: needless to mention, big data is voluminous i.e. it is produced in large quantities

  • Velocity: big data is produced at an extremely quick rate

  • Variety: big data doesn't only refer to data that can be stored in a spreadsheet or a structured database. Most big data is unstructured and consists of different kinds of data such as textual data, images, videos, audio etc

Common examples of big data include the search queries that Google receives, the images/videos/posts that Facebook receives, data produced by smartphones and smart-wearable devices etc

We need proper tools and techniques to make the most of this seemingly endless data.

How do we analyze Big Data?

The underlying principle to analyze big data is the Divide and Conquer methodology. The following figure explains how a workload can be partitioned among several machines, processed in parallel, and how the partial results can be combined to obtain a solution for the original problem.

It is important to note that speeding up the processing is not the goal here. We use Divide and Conquer only because we can't possibly process such large quantities of data on a single machine.

However, parallelization comes with its own share of issues, such as that of communication between the workers, sharing of resources etc. Also, not all workers will complete their share of work at the same time.Therefore, we need a proper synchronization mechanism.

Some questions that might arise when parallelization is implemented:

  • How do we assign work units to workers?

  • What if we have more work units than workers?

  • What if workers need to share partial results?

  • How do we aggregate partial results?

  • How do we know all the workers have finished?

  • What if workers die?

Managing multiple workers becomes challenging because we are unaware of the order in which they work, the way they share resources, in which order they complete execution etc. Operating Systems concepts such as semaphores, mutex locks, barriers, deadlocks etc become important to ensure synchronization.

Currently, we have programming models like:

  • shared memory: wherein workers share memory and communicate via the shared memory

  • message passing: wherein workers communicate among each other by passing messages

Some common design patterns in use today include:

  • master-slaves: one machine allocates work to several other machines and supervises their work

  • producer-consumer: a set of producer machines produce an output that becomes the input for a set of consumer machines and this workflow continues

  • shared work queues: jobs to be processed are placed in a work queue and workers complete these jobs one at a time as and when they become free

Big Ideas

  • Scale OUT, not UP: this means that instead of investing in a single highly-configured machine, we must invest in many low-configuration machines and process the workload using Divide and Conquer

  • Move processing to the data: instead of moving large quantities of data across the network, move code to where the appropriate data is located; this will reduce network costs and latency

  • Process data sequentially; avoid random access: since seeks are expensive, but disk throughput is reasonable

  • Seamless scalability

The need for a technique to process big data gave rise to Hadoop.

The next chapter gives an overview of Hadoop.

Last updated