Authors: Shah Pratik Prakash, Pattabiraman V.
Abstract: Data of any kind structured, unstructured or semistructured is generated in large quantity around the globe in various domains. These datasets are stored on multiple nodes in a cluster. MapReduce framework has emerged as the most efficient technique and easy to use for parallel processing of distributed data. This paper proposes a new methodology for mapreduce framework workflow. The proposed methodology provides a way to process raw data in such a way that it requires less processing time to generate the required result. The methodology stores intermediate data which is generated between map and reduce phase and re-used as input to mapreduce. The paper presents methodology which focuses on improving the data reusability, scalability and efficiency of the mapreduce framework for large data analysis. MongoDB 2.4.2 is used to demonstrate the experimental work to show how we can store and reuse intermediate data as a part of mapreduce to improve the processing of large datasets.
International Journal of Computers and Communications, E-ISSN: 2074-1294, Volume 16, 2022, Art. #4
PDF DOI XML