IOTA: Interlinking of heterogeneous multilingual open fiscal DaTA

•IOTA is designed as a scalable framework to interlink translated fiscal concepts.•There are 19 similarity measures experimented within IOTA.•Token Sort yields the highest F1 score yet not robust to threshold change.•TF-IDF has a good and robust F1 score, but it is computationally expensive.•Results...

Full description

Saved in:

Bibliographic Details
Published in	Expert systems with applications Vol. 147; p. 113135
Main Authors	Musyaffa, Fathoni A., Vidal, Maria-Esther, Orlandi, Fabrizio, Lehmann, Jens, Jabeen, Hajira
Format	Journal Article
Language	English
Published	New York Elsevier Ltd 01.06.2020 Elsevier BV
Subjects	Budget and spending data Budgets Cluster computing Computer networks Data interlinking Datasets Distributed processing Machine translation Mapping Multilingualism Open data Similarity String similarity measure Strings Translated string matching framework Budget and spending data String similarity measure Translated string matching framework Data interlinking Open data Cluster computing
Online Access	Get full text

Cover

Loading…

More Information
Summary:	•IOTA is designed as a scalable framework to interlink translated fiscal concepts.•There are 19 similarity measures experimented within IOTA.•Token Sort yields the highest F1 score yet not robust to threshold change.•TF-IDF has a good and robust F1 score, but it is computationally expensive.•Results keep highly & positively correlated even as translation pairs are changed. Open budget data are among the most frequently published datasets of the open data ecosystem, intended to improve public administrations and government transparency. Unfortunately, the prospects of analysis across different open budget data remain limited due to schematic and linguistic differences. Budget and spending datasets are published together with descriptive classifications. Various public administrations typically publish the classifications and concepts in their regional languages. These classifications can be exploited to perform a more in-depth analysis, such as comparing similar items across different, cross-lingual datasets. However, in order to enable such analysis, a mapping across the multilingual classifications of datasets is required. In this paper, we present the framework for Interlinking of Heterogeneous Multilingual Open Fiscal DaTA (IOTA). IOTA makes use of machine translation followed by string similarities to map concepts across different datasets. To the best of our knowledge, IOTA is the first framework to offer scalable implementation of string similarity using distributed computing. The results demonstrate the applicability of the proposed multilingual matching, the scalability of the proposed framework, and an in-depth comparison of string similarity measures.
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2019.113135