TY - GEN
T1 - Towards a Better Replica Management for Hadoop Distributed File System
AU - Ciritoglu, Hilmi Egemen
AU - Saber, Takfarinas
AU - Buda, Teodora Sandra
AU - Murphy, John
AU - Thorpe, Christina
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/9/7
Y1 - 2018/9/7
N2 - The Hadoop Distributed File System (HDFS) is the storage of choice when it comes to large-scale distributed systems. In addition to being efficient and scalable, HDFS provides high throughput and reliability through the replication of data. Recent work exploits this replication feature by dynamically varying the replication factor of in-demand data as a means of increasing data locality and achieving a performance improvement. However, to the best of our knowledge, no study has been performed on the consequences of varying the replication factor. In particular, our work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it. This leads to unbalanced data, hot spots, and performance degradation. In order to address this problem, we propose a new workload-aware balanced replica deletion algorithm. We also show that our algorithm successfully maintains the data balance and achieves up to 48% improvement in execution time when compared to HDFS, while only creating an overhead of 1.69% on average.
AB - The Hadoop Distributed File System (HDFS) is the storage of choice when it comes to large-scale distributed systems. In addition to being efficient and scalable, HDFS provides high throughput and reliability through the replication of data. Recent work exploits this replication feature by dynamically varying the replication factor of in-demand data as a means of increasing data locality and achieving a performance improvement. However, to the best of our knowledge, no study has been performed on the consequences of varying the replication factor. In particular, our work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it. This leads to unbalanced data, hot spots, and performance degradation. In order to address this problem, we propose a new workload-aware balanced replica deletion algorithm. We also show that our algorithm successfully maintains the data balance and achieves up to 48% improvement in execution time when compared to HDFS, while only creating an overhead of 1.69% on average.
KW - Hadoop Distributed File System
KW - Replication Factor
KW - Software Performance
UR - https://www.scopus.com/pages/publications/85057754965
U2 - 10.1109/BigDataCongress.2018.00021
DO - 10.1109/BigDataCongress.2018.00021
M3 - Conference contribution
AN - SCOPUS:85057754965
T3 - Proceedings - 2018 IEEE International Congress on Big Data, BigData Congress 2018 - Part of the 2018 IEEE World Congress on Services
SP - 104
EP - 111
BT - Proceedings - 2018 IEEE International Congress on Big Data, BigData Congress 2018 - Part of the 2018 IEEE World Congress on Services
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 7th IEEE International Congress on Big Data, BigData Congress 2018 Part of the 2018 IEEE World Congress on Services
Y2 - 2 July 2018 through 7 July 2018
ER -