Jaccard-Like Fuzzy Distances for Computational Linguistics

Back in 1967 the Croat linguist Ž. Muljačić introduced a fuzzy generalization of crisp Hamming distances between binary strings of length n ; he wanted to show that Dalmatic, nowadays extinct, is a bridge between the Western group of Romance languages and the Eastern group, basically Romanian. Ea...

Full description

Saved in:
Bibliographic Details
Published in2017 19th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC) pp. 196 - 202
Main Author Franzoi, Laura
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.09.2017
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Back in 1967 the Croat linguist Ž. Muljačić introduced a fuzzy generalization of crisp Hamming distances between binary strings of length n ; he wanted to show that Dalmatic, nowadays extinct, is a bridge between the Western group of Romance languages and the Eastern group, basically Romanian. Each language is described by means of n features F i which can be present or absent, and so is encoded by a string x i ... x n , where x i is the truth degree of the proposition feature F i is present in the language; however, presence/absence can be ill-defined: consequentely, each x i is rather a truth degree ∈[0;1] in a multi-valued logic, a crisp value only when x i = 0 = false = absent ; or x i = 1 = true = present , else strictly fuzzy . More recently Longobardi et al. [1], [2] have covered the case when a feature F i is undefined, because logically inconsistent with truth degrees assigned to features F 1 ,..., F i-1 or when a feature is irrelevant because crisply absent or "almost" absent in both languages. The latter fact requires a Jaccard variant of the original distance. We modify the fuzzy Hamming distance, as in Muljačić case [3], [4], going to its Jaccard variant and do the same with fuzzy Hamming distinguishabilities, which are a subtle but meaningful variation of the fuzzy Hamming distance [3]. Using the technical tool of Steinhaus transforms, which serves to obtain the Jaccardlike variant of a given distance, we end up obtaining four metric distances: fuzzy distance and distinguishability without irrelevance, and their corresponding Jaccard variants with both fuzziness and irrelevance. Accordingly, we cluster in four ways Muljačić original data and comment on the differences; all this paves the way towards gauging jointly ill-defined, irrelevant but also conditionally undefined features, as in [1], [2]. The tools developed here for the first time will be used on up-to-date linguistic data within the activities of the Human Language Technologies Research Center, Bucharest University.
DOI:10.1109/SYNASC.2017.00040