MapReduce

MapReduce in NCache allows developers to write programs that process huge amounts of unstructured data in parallel across an NCache cluster. To distribute input data and analyze it in parallel, MapReduce operates in parallel on all nodes in a cluster of any size. The term “MapReduce” refers to two distinct phases. The first phase is ‘Map’ phase, which takes a set of data and converts it into another set of data, where individual items are broken down into key-value pairs. The second phase is ‘Reduce’ phase, which takes output from ‘Map’ as an input and reduces that data set into a smaller and more meaningful data set.

MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on an NCache cluster. A user defined Map function processes a key-value pair to generate a set of intermediate key-value pairs. Reduce function processes all those intermediate key-value pairs (having same intermediate key) to aggregate, perform calculations or any other operation on the pairs. Another optional component, called Combiner, performs merging of the intermediate key-value pairs generated by Mapper before these key-value pairs can be sent over to the Reducer.

How does MapReduce Work?

Generally, MapReduce consists of two (sometimes three) phases: i.e. Mapping, Combining (optional) and Reducing.

1. Mapping phase: filters and prepares the input for the next phase that may be Combining or Reducing.

2. Reduction phase: takes care of the aggregation and compilation of the final result.

3. Combining phase: responsible for reduction local to the node, before sending the input to the Reducers. Combine phase optimizes performance as it minimizes the network traffic between Mapper and Reducers by sending the output to the Reducer in chunks.

Similarly, NCache MapReduce has three phases: Map, Combine, Reduce. Only the Mapper is necessary to implement – Reducer implementation is optional.

NCache MapReduce will execute its default reduction during the task if the Reducer is not implemented by the user.

The Mapper, Combiner and Reducer are executed simultaneously during a NCache MapReduce task on the NCache cluster. Mapper output is individually sent to the Combiner. When Combiner’s output reaches the specified chunk size, it is then sent to the Reducer, which finalizes and persists the output.

In order to monitor the submitted task, a trackable object is provided to the user.

A typical MapReduce task has the following components:

Mapper: Processes input into key-value pairs to generate a set of intermediate key-value pairs to be sent for further refining and extraction of the data.

Combiner Factory: Assigns a unique Combiner to each key provided to it. User can implement it to provide different combiners for different keys.

Combiner: Works as local reducer to the node where Mapper’s output has been combined to minimize traffic between Mapper and Reducer. The tasks are processed and stored in bulk before being sent to the Reducer, meaning the data from Mapper is processed in chunks, the size of which is configurable. Once the combined results reach the specified chunk size, the elements are forwarded to the Reducer.

Reducer Factory: Assigns a unique Reducer to each key provided to it. User can implement it to provide different reducers for different keys.

Reducer: Processes all those intermediate key-value pairs generated by Mapper or combined by Combiner to aggregate, perform calculations or apply different operations to produce the reduced output.

Key Filter: Key Filter, as the name indicates, allows the user to filter cache data based on its keys before being provided to the Mapper. The KeyFilter is called during Mapper’s execution. If it returns true, the Map will be executed on the key. If it returns false, Mapper will skip the key and move to next one from the Cache.

TrackerTask: This component lets you keep track of the progress of the task and its status as the task is executed. And lets you fetch the output of the task and enumerate it.

Output: The output is stored in-memory, on the server side. It can be enumerated using the TrackableTask instance on the client application.

In This Section

Sample Implementation of Interfaces
Explains, with samples, how to implement the required interfaces for MapReduce.

Sample Usage of MapReduce
Explains the steps to execute the MapReduce task.