Efficient multi-way Theta-join processing using MapReduce
Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volumes. In this work, we study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a costeffective perspective. Although there have been some works using the (key,value) pair-based programming model to support join operations, efficient processing of multi-way Thetajoin queries has never been fully explored. The substantial challenge lies in, given a number of processing units (that can run Map or Reduce tasks), mapping a multi-way Thetajoin query to a number of MapReduce jobs and having them executed in a well scheduled sequence, such that the total processing time span is minimized. Our solution mainly includes two parts: 1) cost metrics for both single MapReduce job and a number of MapReduce jobs executed in a certain order; 2) the efficient execution of a chain-typed Theta-join with only one MapReduce job. Comparing with the query evaluation strategy proposed in  and the widely adopted Pig Latin and Hive SQL solutions, our method achieves significant improvement of the join processing efficiency. © 2012 VLDB Endowment.
Proceedings of the VLDB Endowment
Zhang, X., Chen, L., & Wang, M. (2012). Efficient multi-way Theta-join processing using MapReduce. Proceedings of the VLDB Endowment, 5 (11), 1184-1195. https://doi.org/10.14778/2350229.2350238