Approximating the Schema of a Set of Documents by Means of Resemblance

The WWW contains a huge amount of documents. Some of them share the same subject, but are generated by different people or even by different organizations. A semi-structured model allows to share documents that do not have exactly the same structure. However, it does not facilitate the understanding...

Full description

Saved in:

Bibliographic Details
Published in	Journal on data semantics Vol. 7; no. 2; pp. 87 - 105
Main Authors	Abelló, Alberto, de Palol, Xavier, Hacid, Mohand-Saïd
Format	Journal Article Publication
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.06.2018 Springer Nature B.V Springer
Subjects	Algorithms Artificial Intelligence Automatic data collection systems Classificació automàtica Computer Science Data mining Database Management Design Document Information Storage and Retrieval Information Systems Applications (incl.Internet) Informàtica IT in Business Mineria de dades Original Article Sistemes d'informació XML Àrees temàtiques de la UPC Design Document XML
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The WWW contains a huge amount of documents. Some of them share the same subject, but are generated by different people or even by different organizations. A semi-structured model allows to share documents that do not have exactly the same structure. However, it does not facilitate the understanding of such heterogeneous documents. In this paper, we offer a characterization and algorithm to obtain a representative (in terms of a resemblance function) of a set of heterogeneous semi-structured documents. We approximate the representative so that the resemblance function is maximized. Then, the algorithm is generalized to deal with repetitions and different classes of documents. Although an exact representative could always be found using an unlimited number of optional elements, it would cause an overfitting problem. The size of an exact representative for a set of heterogeneous documents may even make it useless. Our experiments show that, for users, it is easier and faster to deal with smaller representatives, even compensating the loss in the approximation.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1861-2032 1861-2040
DOI:	10.1007/s13740-018-0088-0