Efficient join query processing on the cloud


Join query is one of the most expressive and expensive data analytic tools in traditional database systems. Along with the exponential growth of various data collections, NoSQL data storage has risen as the prevailing solution for big data. However, without the strong support of heavy index, the join operator becomes even more crucial and challenging for querying against or mining from massive data. As reported from Facebook [1] and Google [2], the underlying data volume is of hundreds of terabytes or even petabytes. In such scenarios, solutions from the traditional distributed or parallel databases are infeasible due to unsatisfactory scalability and poor fault tolerance. ere have been intensive studies on dierent types of join operations over distributed data, for example, similarity join, set join, fuzzy join, all of which focus on e cient join query evaluation by exploring the massive parallelism of the MapReduce computing framework on the cloud platform. In this chapter, we explore the e cient processing of multiway generalized join queries, namely, the “complex join,” which are widely employed in various practical data analytic scenarios, that is, querying resource description framework (RDF), feature selection from biochemical data, and so on. e substantial challenge of complex join lies in, given a number of processing units, mapping a complex join query to a number of parallel tasks and having them executed in a well-scheduled sequence such that the total processing time span is minimized. In this chapter, we focus on the evaluation of complex join queries on the cloud platform and elaborate with case studies on the e cient Simple Protocol and RDF Query Language (SPARQL) query processing and the multiway theta-join evaluation.

Publication Title

Cloud Computing and Digital Media: Fundamentals, Techniques, and Applications