Skip to main navigation Skip to search Skip to main content

Data harmonization for heterogeneous datasets: A systematic literature review

  • Ganesh Kumar
  • , Shuib Basri
  • , Abdullahi Abubakar Imam
  • , Sunder Ali Khowaja
  • , Luiz Fernando Capretz
  • , Abdullateef Oluwagbemiga Balogun

Research output: Contribution to journalReview articlepeer-review

Abstract

As data size increases drastically, its variety also increases. Investigating such heterogeneous data is one of the most challenging tasks in information management and data analytics. The heterogeneity and decentralization of data sources affect data visualization and prediction, thereby influencing analytical results accordingly. Data harmonization (DH) corresponds to a field that uni-fies the representation of such a disparate nature of data. Over the years, multiple solutions have been developed to minimize the heterogeneity aspects and disparity in formats of big‐data types. In this study, a systematic review of the literature was conducted to assess the state‐of‐the‐art DH techniques. This study aimed to understand the issues faced due to heterogeneity, the need for DH and the techniques that deal with substantial heterogeneous textual datasets. The process produced 1355 articles, but among them, only 70 articles were found to be relevant through inclusion and exclusion criteria methods. The result shows that the heterogeneity of structured, semi‐structured, and unstructured (SSU) data can be managed by using DH and its core techniques, such as text preprocessing, Natural Language Preprocessing (NLP), machine learning (ML), and deep learning (DL). These techniques are applied to many real‐world applications centered on the information-retrieval domain. Several assessment criteria were implemented to measure the efficiency of these techniques, such as precision, recall, F‐1, accuracy, and time. A detailed explanation of each research question, common techniques, and performance measures is also discussed. Lastly, we present readers with a detailed discussion of the existing work, contributions, and managerial and academic implications, along with the conclusion, limitations, and future research directions.

Original languageEnglish
Article number8275
JournalApplied Sciences (Switzerland)
Volume11
Issue number17
DOIs
Publication statusPublished - Sep 2021
Externally publishedYes

Keywords

  • Data harmonization
  • Heterogeneous data
  • Text preprocessing

Fingerprint

Dive into the research topics of 'Data harmonization for heterogeneous datasets: A systematic literature review'. Together they form a unique fingerprint.

Cite this