Importance of data distribution on hive-based systems for query performance: An experimental study

Hilmi Egemen Ciritoglu, John Murphy, Christina Thorpe

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

SQL-on-Hadoop systems have been gaining popularity in recent years. One popular example of SQL-on-Hadoop systems is Apache Hive; the pioneer of SQL-on-Hadoop systems. Hive is located on the top of big data stack as an application layer. Besides the application layer, the Hadoop Ecosystem is composed of 3 different main layers: storage, the resource manager and processing engine. The demand from industry has led to the development of new efficient components for each layer. As the ecosystem evolves over time, Hive employed different execution engines too. Understanding the strengths of components is very important in order to exploit the full performance of the Hadoop Ecosystem. Therefore, recent works in the literature study the importance of each layer separately. To the best of our knowledge, the present work is the first work that focuses on the performance of the combination of both the storage layer and the execution engine. In this work, we compare the Hive's query performance by using three different execution engines: MR, Tez and Spark on the skewed/well-balanced data distribution through the full TPC-H benchmark. Our results show the importance of data distribution on the storage layer for overall job performance of SQL-on-Hadoop systems and empirically showed even distribution improves performance up to 48% compared to skewed distribution. Moreover, the present study provides insightful findings by identifying particular SQL query cases that the certain processing engine deals exceptionally well.

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE International Conference on Big Data and Smart Computing, BigComp 2020
EditorsWookey Lee, Luonan Chen, Yang-Sae Moon, Julien Bourgeois, Mehdi Bennis, Yu-Feng Li, Young-Guk Ha, Hyuk-Yoon Kwon, Alfredo Cuzzocrea
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages370-376
Number of pages7
ISBN (Electronic)9781728160344
DOIs
Publication statusPublished - Feb 2020
Event2020 IEEE International Conference on Big Data and Smart Computing, BigComp 2020 - Busan, Korea, Republic of
Duration: 19 Feb 202022 Feb 2020

Publication series

NameProceedings - 2020 IEEE International Conference on Big Data and Smart Computing, BigComp 2020

Conference

Conference2020 IEEE International Conference on Big Data and Smart Computing, BigComp 2020
Country/TerritoryKorea, Republic of
CityBusan
Period19/02/2022/02/20

Keywords

  • Data distribution
  • Hadoop
  • HDFS
  • Software Performance
  • SQL-on-Hadoop

Fingerprint

Dive into the research topics of 'Importance of data distribution on hive-based systems for query performance: An experimental study'. Together they form a unique fingerprint.

Cite this