Chunk size affects parallel processing and random disk I/O during MapReduce jobs. A higher chunk size means less parallel processing because there are fewer map inputs, and therefore fewer mappers. A lower chunk size improves parallelism, but results in higher random disk I/O during shuffle because there are more map outputs. Set the
io.sort.mb parameter to a value between 120% and 150% of the chunk size.
Here are general guidelines for chunk size:
- For most purposes, set the chunk size to the default 256 MB and set the value of the
io.sort.mbparameter to the default 380 MB.
- On very small clusters or nodes with not much RAM, set the chunk size to 128 mb and set the value of the
io.sort.mbparameter to 190 MB.
- If application-level compression is in use, the
io.sort.mbparameter should be at least 380MB.
For more information about the chunk size and how to configure it, see Chunk Size.