IOTA: Interlinking of heterogeneous multilingual open fiscal DaTA

•IOTA is designed as a scalable framework to interlink translated fiscal concepts.•There are 19 similarity measures experimented within IOTA.•Token Sort yields the highest F1 score yet not robust to threshold change.•TF-IDF has a good and robust F1 score, but it is computationally expensive.•Results...

Full description

Saved in:
Bibliographic Details
Published inExpert systems with applications Vol. 147; p. 113135
Main Authors Musyaffa, Fathoni A., Vidal, Maria-Esther, Orlandi, Fabrizio, Lehmann, Jens, Jabeen, Hajira
Format Journal Article
LanguageEnglish
Published New York Elsevier Ltd 01.06.2020
Elsevier BV
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•IOTA is designed as a scalable framework to interlink translated fiscal concepts.•There are 19 similarity measures experimented within IOTA.•Token Sort yields the highest F1 score yet not robust to threshold change.•TF-IDF has a good and robust F1 score, but it is computationally expensive.•Results keep highly & positively correlated even as translation pairs are changed. Open budget data are among the most frequently published datasets of the open data ecosystem, intended to improve public administrations and government transparency. Unfortunately, the prospects of analysis across different open budget data remain limited due to schematic and linguistic differences. Budget and spending datasets are published together with descriptive classifications. Various public administrations typically publish the classifications and concepts in their regional languages. These classifications can be exploited to perform a more in-depth analysis, such as comparing similar items across different, cross-lingual datasets. However, in order to enable such analysis, a mapping across the multilingual classifications of datasets is required. In this paper, we present the framework for Interlinking of Heterogeneous Multilingual Open Fiscal DaTA (IOTA). IOTA makes use of machine translation followed by string similarities to map concepts across different datasets. To the best of our knowledge, IOTA is the first framework to offer scalable implementation of string similarity using distributed computing. The results demonstrate the applicability of the proposed multilingual matching, the scalability of the proposed framework, and an in-depth comparison of string similarity measures.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2019.113135