Data Types in Hadoop
While programming for a distributed system, we cannot use standard data types. This is because they do not know how to read from/write to the disk i.e. they are not serializable.
Hadoop provides Writable interface based data types for serialization and de-serialization of data storage in HDFS and MapReduce computations.
Serialization is the process of converting object data into byte stream data for transmission over a network across different nodes in a cluster or for persistent data storage.
De-serialization is the reverse process of serialization and converts byte stream data into object data for reading data from HDFS.
All types in Hadoop must implement the Writable interface. Hadoop provides Writable wrappers for almost all Java primitive types and some other types. However, we might sometimes need to create our own wrappers for custom objects.
All the Writable wrapper classes have a get() and a set() method for retrieving and storing the wrapped value.
Hadoop also provides another interface called WritableComparable. The WritableComparable interface is a sub-interface of Hadoop’s Writable and Java’s Comparable interfaces.
Note that the Writable and WritableComparable interfaces are provided in the org.apache.hadoop.io package.
As we know, data flows from mappers to reducers in the form of (key, value) pairs. It is important to note that any data type used for the key must implement the WritableComparable interface along with Writable interface to compare the keys of this type with each other for sorting purposes, and any data type used for the value must implement the Writable interface.
Writable Classes
Primitive Writable Classes
These are Writable wrappers for Java primitive data types and they hold a single primitive value. Below is the list of primitive writable data types available in Hadoop:
BooleanWritable
ByteWritable
IntWritable
VIntWritable
FloatWritable
LongWritable
VLongWritable
DoubleWritable
Note that the serialized sizes of the above primitive Writable data types are same as the size of the actual Java data types.
Array Writable Classes
Hadoop provides two types of array Writable classes: one for single-dimensional and another for two-dimensional arrays:
ArrayWritable
TwoDArrayWritable
The elements of these arrays must be other Writable objects like IntWritable or FLoatWritable only; not the Java native data types like int or float.
Map Writable Classes
Hadoop provides the following MapWritable data types which implement the java.util.Map interface:
AbstractMapWritable: this is the abstract or base class for other MapWritable classes
MapWritable: this is a general purpose map, mapping Writable keys to Writable values
SortedMapWritable: this is a specialization of the MapWritable class that also implements the SortedMap interface
Other Writable Classes
NullWritable: it is a special type of Writable representing a null value. No bytes are read or written when a data type is specified as NullWritable. So, in Mapreduce, a key or a value can be declared as a NullWritable when we don’t need to use that field
ObjectWritable: this is a general-purpose generic object wrapper which can store any objects like Java primitives, String, Enum, Writable, null, or arrays
Text: it can be used as the Writable equivalent of java.lang.String and its max size is 2 GB. Unlike Java’s String data type, Text is mutable in Hadoop
BytesWritable: it is a wrapper for an array of binary data
GenericWritable: it is similar to ObjectWritable but supports only a few types. The user needs to subclass this GenericWritable class and needs to specify the types to support
How to Create Complex Data Types in Hadoop?
As discussed earlier, Hadoop already has Writable wrappers for most primitive Java data types. However, if we need to use a data type for which Hadoop doesn't already have a Writable wrapper, we can do one of the following:
Encode it as Text, Ex. (a, b) = “a:b”
Use regular expressions to parse and extract data
Define a custom implementation of WritableComparable and define its methods (readFields(), write(), compareTo()), etc.
The next section gives an overview of programming in Hadoop.
Last updated