MapReduce

You can configure MapReduce for processing and generating large data sets with a parallel, distributed algorithm on a cluster. To distribute input data and analyze it in parallel, MapReduce operates in parallel on all nodes in a cluster of any size. The term “MapReduce" refers to two distinct phases. The first phase is ‘Map’ phase, which takes a set of data and converts it into another set of data, where individual items are broken down into key-value pairs. The second phase is ‘Reduce’ phase, which takes output from ‘Map’ as an input and reduces that data set into a smaller and more meaningful data set.

For more detailed explanation of MapReduce please see the MapReduce section.

For NCache Professional Edition, you can add mapreduce using Windows PowerShell.

Add-MapReduce cmdlet configures MapReduce tasks for processing and generating large data sets with a parallel, distributed algorithm on a clustered cache.

The following example configures mapreduce execution on demoClusteredCache with 20 tasks to be executed in parallel with chunks of 100 elements each, 30 tasks to be enqueued and maximum 10 exceptions to be avoided.

Add-MapReduce demoCache -MaxTasks 10 -ChunkSize 100 -QueueSize 30 -MaxExceptions 10

The tags mentioned above are explained below.

max-tasks: In order to provide maximum scalability, the MapReduce job to be performed on data sets are divided into smaller subsets of tasks which can be performed simultaneously. max-tasks is the maximum number of MapReduce tasks to be executed simultaneously. It can be changed according to your requirements. Its default value is 10.

chunk-size: The tasks are processed and stored in bulk before being sent to the Reducer, meaning the data from Mapper is processed in chunks, the size of which is configurable. The Chunk-Size is the number of tasks processed in the Mapper and Combiner - before transmitting to Combiner or Reducer. When Combiner’s output reaches the specified chunk size, it is then sent to the Reducer, which finalizes and persists the output. The default value for chunk-size is 100.

queue-size: The tasks that are performed are lined up in a queue in which the tasks wait for the former tasks to be performed before they can be processed. queue-size is the maximum number of tasks that can wait in queue before they are processed due to the other tasks being processed. The default value is 10.

communicate-stats: Used for the MapReduce tasks to communicate the statistics internally. It is set false by default.

max-avoidable-exceptions: In case you expect exceptions to be thrown during task execution, you can specify the number of exceptions to be avoided from your code, after which the task is failed and logged in the cache error log. The default value is 10.

MapReduce

See Also