Cross domain assessment of document to html conversion tools to quantify text and structural loss during document analysis

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

During forensic text analysis, the automation of the process is key when working with large quantities of documents. As documents often come in a wide variety of different file types, this creates the need for tailored tools to be developed to analyze each document type to correctly identify and extract text elements for analysis without loss. These text extraction tools often omit sections of text that are unreadable from documents leaving drastic inconsistencies during the forensic text analysis process. As a solution to this a single output format, HTML, was chosen as a unified analysis format. Document to HTML/CSS extraction tools each with varying techniques to convert common document formats to rich HTML/CSS counterparts were tested. This approach can reduce the amount of analysis tools needed during forensic text analysis by utilizing a single document format. Two tests were designed, a 10 point document overview test and a 48 point detailed document analysis test to assess and quantify the level of loss, rate of error and overall quality of outputted HTML structures. This study concluded that tools that utilize a number of different approaches and have an understanding of the document structure yield the best results with the least amount of loss.

Original languageEnglish
Title of host publicationProceedings - 2013 European Intelligence and Security Informatics Conference, EISIC 2013
Pages100-105
Number of pages6
DOIs
Publication statusPublished - 2013
Event2013 4th European Intelligence and Security Informatics Conference, EISIC 2013 - Uppsala, Sweden
Duration: 12 Aug 201314 Aug 2013

Publication series

NameProceedings - 2013 European Intelligence and Security Informatics Conference, EISIC 2013

Conference

Conference2013 4th European Intelligence and Security Informatics Conference, EISIC 2013
Country/TerritorySweden
CityUppsala
Period12/08/1314/08/13

Fingerprint

Dive into the research topics of 'Cross domain assessment of document to html conversion tools to quantify text and structural loss during document analysis'. Together they form a unique fingerprint.

Cite this