The basic idea of parallel data processing is to split large jobs into smaller tasks that are processed simultaneously by multiple servers. The concept is intuitive, but there is a range of different models of parallel data processing. Furthermore, seemingly small differences in proportion of queues to servers and differences in the assumptions about arrival and operation processes can lead to radically different scaling behaviors.
The current dominance of the MapReduce architecture and its multiple implementations lead us to explore the basic behavior and scalability of these systems. There is a variety of abstract models of parallel systems, ranging from the Split-Merge model, which has very unfavorable scaling characteristics, through the popular and much-analyzed Fork-Join model, to Non-Idling Single-Queue models, which are more representative of MapReduce Task Manager and which realize very good gains from load balancing. In fact, it depends on the implementation of a MapReduce program whether a scaling behavior is being exhibited in the scope from Split-Merge to Non-Idling Single-Queue or not. In ideal case, the use of k parallel servers would speed up the processing of any job k times, but this ideal case is difficult to achieve both in theory and in practice. The key property that prevents parallel systems from exhibiting this optimal scaling behavior is an irregular division of work into individual tasks. For most of the models the following applies: A job is not complete until all of its tasks have been completed. As a result, a disproportionate task, or a "straggler", can dominate the sojourn time of a job. Accordingly, a task that gets stuck in a queue behind a straggler task from another job will dominate the sojourn time of its job. If all tasks had identical operating times, a parallel system would behave as a conventional queue. In order to achieve a balance between solvability and triviality of the problem, analytical models usually assume that the task operating times are independent random numbers of a common distribution. We therefore propose a research agenda, both empirical and theoretical, that explores the applicability of different models of parallel data processing and the division of jobs into tasks in real MapReduce systems. We tend to use the findings of our studies to design new schemes for MapReduce Task Manager and Resource Manager.