The amount of data available on theWeb has grown significantly in the past years, increasing thus the need for efficient techniques able to retrieve information from data in order to discover valuable and relevant knowledge. In the last decade, the intersection of the Information Extraction and Semantic Web areas is providing new opportunities for improving ontology-based information extraction tools. However, one of the critical aspects in the development and evaluation of this type of system is the limited availability of existing annotated documents, especially in domains such as the historical one. In this paper we present the current state of affairs about our work in building a large and real-world RDF dataset with the purpose to support the development of Ontology-Based extraction tools. The presented dataset is the result of the efforts made within the ARKIVO project and it counts about 300 thousand triples, which are the outcome of the manually annotation process executed by domain experts. ARKIVO dataset is freely available and it can be used as a benchmark for the evaluation of systems that automatically annotate and extract entities from documents.
ARKIVO Dataset: A Benchmark for Ontology-based Extraction Tools / Pandolfo, L.; Pulina, L.. - 2021-:(2021), pp. 341-345. (Intervento presentato al convegno 17th International Conference on Web Information Systems and Technologies, WEBIST 2021 nel 2021).
ARKIVO Dataset: A Benchmark for Ontology-based Extraction Tools
Pandolfo L.;Pulina L.
2021-01-01
Abstract
The amount of data available on theWeb has grown significantly in the past years, increasing thus the need for efficient techniques able to retrieve information from data in order to discover valuable and relevant knowledge. In the last decade, the intersection of the Information Extraction and Semantic Web areas is providing new opportunities for improving ontology-based information extraction tools. However, one of the critical aspects in the development and evaluation of this type of system is the limited availability of existing annotated documents, especially in domains such as the historical one. In this paper we present the current state of affairs about our work in building a large and real-world RDF dataset with the purpose to support the development of Ontology-Based extraction tools. The presented dataset is the result of the efforts made within the ARKIVO project and it counts about 300 thousand triples, which are the outcome of the manually annotation process executed by domain experts. ARKIVO dataset is freely available and it can be used as a benchmark for the evaluation of systems that automatically annotate and extract entities from documents.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.