Approximating the Schema of a Set of Documents by Means of Resemblance

The WWW contains a huge amount of documents. Some of them share the same subject, but are generated by different people or even by different organizations. A semi-structured model allows to share documents that do not have exactly the same structure. However, it does not facilitate the understanding...

Full description

Saved in:

Bibliographic Details
Published in	Journal on data semantics Vol. 7; no. 2; pp. 87 - 105
Main Authors	Abelló, Alberto, de Palol, Xavier, Hacid, Mohand-Saïd
Format	Journal Article Publication
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.06.2018 Springer Nature B.V Springer
Subjects	Algorithms Artificial Intelligence Automatic data collection systems Classificació automàtica Computer Science Data mining Database Management Design Document Information Storage and Retrieval Information Systems Applications (incl.Internet) Informàtica IT in Business Mineria de dades Original Article Sistemes d'informació XML Àrees temàtiques de la UPC Design Document XML
Online Access	Get full text
ISSN	1861-2032 1861-2040
DOI	10.1007/s13740-018-0088-0

Cover

Loading…

Abstract	The WWW contains a huge amount of documents. Some of them share the same subject, but are generated by different people or even by different organizations. A semi-structured model allows to share documents that do not have exactly the same structure. However, it does not facilitate the understanding of such heterogeneous documents. In this paper, we offer a characterization and algorithm to obtain a representative (in terms of a resemblance function) of a set of heterogeneous semi-structured documents. We approximate the representative so that the resemblance function is maximized. Then, the algorithm is generalized to deal with repetitions and different classes of documents. Although an exact representative could always be found using an unlimited number of optional elements, it would cause an overfitting problem. The size of an exact representative for a set of heterogeneous documents may even make it useless. Our experiments show that, for users, it is easier and faster to deal with smaller representatives, even compensating the loss in the approximation.
AbstractList	The WWW contains a huge amount of documents. Some of them share the same subject, but are generated by different people or even by different organizations. A semi-structured model allows to share documents that do not have exactly the same structure. However, it does not facilitate the understanding of such heterogeneous documents. In this paper, we offer a characterization and algorithm to obtain a representative (in terms of a resemblance function) of a set of heterogeneous semi-structured documents. We approximate the representative so that the resemblance function is maximized. Then, the algorithm is generalized to deal with repetitions and different classes of documents. Although an exact representative could always be found using an unlimited number of optional elements, it would cause an overfitting problem. The size of an exact representative for a set of heterogeneous documents may even make it useless. Our experiments show that, for users, it is easier and faster to deal with smaller representatives, even compensating the loss in the approximation. The WWW contains a huge amount of documents. Some of them share the same subject, but are generated by different people or even by different organizations. A semi-structured model allows to share documents that do not have exactly the same structure. However, it does not facilitate the understanding of such heterogeneous documents. In this paper, we offer a characterization and algorithm to obtain a representative (in terms of a resemblance function) of a set of heterogeneous semi-structured documents. We approximate the representative so that the resemblance function is maximized. Then, the algorithm is generalized to deal with repetitions and different classes of documents. Although an exact representative could always be found using an unlimited number of optional elements, it would cause an overfitting problem. The size of an exact representative for a set of heterogeneous documents may even make it useless. Our experiments show that, for users, it is easier and faster to deal with smaller representatives, even compensating the loss in the approximation. Peer Reviewed
Author	Abelló, Alberto de Palol, Xavier Hacid, Mohand-Saïd
Author_xml	– sequence: 1 givenname: Alberto orcidid: 0000-0002-3223-2186 surname: Abelló fullname: Abelló, Alberto email: aabello@essi.upc.edu organization: Dept. de Llenguatges i Sistemes Informàtics, U. Politècnica de Catalunya – sequence: 2 givenname: Xavier surname: de Palol fullname: de Palol, Xavier organization: Age Fotostock – sequence: 3 givenname: Mohand-Saïd surname: Hacid fullname: Hacid, Mohand-Saïd organization: LIRIS- UFR d’Informatique, U. Claude Bernard Lyon 1
BackLink	https://hal.science/hal-01971563$$DView record in HAL
BookMark	eNp9UU1rGzEQFcGFfNQ_ILeFnHLYZkbyStqjcfMFDoWmOQtZnrU32FpHkkvz76Pt5uvSCqQZDe8Nb-Yds5HvPDF2ivANAdRFRKEmUALqEkDn54AdoZZYcpjA6D0X_JCNY3yEfCQKqeGIXU13u9D9abc2tX5VpDUV925NW1t0TWGLe0p98r1z-y35FIvFc3FH1se--pMibRcb6x19ZV8au4k0fo0n7OHq8tfsppz_uL6dTeelm2iRymWDNTVcL5WWVFsQDlVNuKyxwaoClLzivNG6FhNcKmsXqrK11JqQUCmS4oSdD33XdmN2IcsOz6azrbmZzk1fA6wVVlL8xozFAevi3plAjoKz6S_6_dNfDoobIUDKOnPOBk5eytOeYjKP3T74PFKGVVxLjdij1Gvn0MUYqDGuTXmBnU_BthuDYHpfzOBL1qRN74uBT5remG9D_I_DB07MWL-i8KHp36QXg2iddw
CitedBy_id	crossref_primary_10_2174_0126662558273437231204061106 crossref_primary_10_1007_s10844_018_0536_1
Cites_doi	10.1007/3-540-44533-1_24 10.1023/A:1021560618289 10.1137/0218082 10.1016/j.is.2018.02.007 10.1016/j.knosys.2006.08.006 10.14778/2777598.2777601 10.1016/S0020-0190(02)00345-9 10.1007/s11086-005-0032-6 10.1007/978-3-540-30081-6_8 10.1145/1841909.1841911 10.1016/S0304-3975(00)00294-2 10.1007/978-3-540-45227-0_12 10.1109/ICDEW.2006.166 10.1007/3-540-62222-5_33 10.1007/BF01202268 10.1016/B978-1-4832-1452-8.50145-7 10.1109/TKDE.2004.1264824 10.4018/978-1-59904-228-2.ch003 10.1016/j.is.2004.11.009 10.1007/978-3-642-39200-9_8 10.1016/S0306-4379(03)00031-0 10.1145/276304.276331
ContentType	Journal Article Publication
Contributor	Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Service, Information and Data Engineering
Contributor_xml	– sequence: 1 fullname: Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació – sequence: 2 fullname: Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Service, Information and Data Engineering
Copyright	Springer-Verlag GmbH Germany, part of Springer Nature 2018 Copyright Springer Science & Business Media 2018 info:eu-repo/semantics/openAccess Distributed under a Creative Commons Attribution 4.0 International License
Copyright_xml	– notice: Springer-Verlag GmbH Germany, part of Springer Nature 2018 – notice: Copyright Springer Science & Business Media 2018 – notice: info:eu-repo/semantics/openAccess – notice: Distributed under a Creative Commons Attribution 4.0 International License
DBID	AAYXX CITATION XX2 1XC
DOI	10.1007/s13740-018-0088-0
DatabaseName	CrossRef Recercat Hyper Article en Ligne (HAL)
DatabaseTitle	CrossRef
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering Computer Science
EISSN	1861-2040
EndPage	105
ExternalDocumentID	oai_HAL_hal_01971563v1 oai_recercat_cat_2072_330669 10_1007_s13740_018_0088_0
GroupedDBID	-EM 0R~ 0VY 203 30V 4.4 408 409 96X AAAVM AAHNG AAIAL AAJKR AARHV AARTL AATVU AAWCG AAYIU AAYQN AAYTO AAYZH AAZMS ABBXA ABDZT ABECU ABFTD ABFTV ABJNI ABJOX ABKCH ABMQK ABQBU ABTEG ABTHY ABTMW ABXPI ACBXY ACGFS ACKNC ACMLO ACOKC ADHHG ADHIR ADINQ ADKNI ADKPE ADRFC ADURQ ADYFF ADZKW AEBTG AEGNC AEJHL AEJRE AEOHA AEPYU AETCA AEXYK AFBBN AFLOW AFQWF AFWTZ AFZKB AGAYW AGDGC AGQMX AGWZB AGYKE AHAVH AHBYD AHKAY AHSBF AHYZX AI. AIIXL AILAN AITGF AJBLW AJRNO AJZVZ AKLTO ALFXC ALMA_UNASSIGNED_HOLDINGS AMKLP AMYQR ANMIH ASPBG AUKKA AVWKF AXYYD AYJHY AZFZN BGNMA CSCUP DNIVK EBS EIOEI EJD ESBYG FEDTE FERAY FINBP FNLPD FRRFC FSGXE FYJPI GGRSB GJIRD GQ6 HF~ HMJXF HQYDN HRMNR HVGLF HZ~ I0C IXD J-C JBSCW JCJTX KOV M4Y NQJWS NU0 O9- O93 O9G O9J RLLFE RSV SCO SHX SISQX SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE TSG UG4 UOJIU UTJUX UZXMN VC2 VFIZW VH1 W48 Z83 Z88 ZMTXR AAYXX ABFSG ACSTC AEZWR AFHIU AHWEU AIXLP CITATION EBLON XX2 1XC
ID	FETCH-LOGICAL-c483t-df19ef28d786e9a03c179e1d91f1550162522f889341d7aab75a9688e1e177e63
IEDL.DBID	AGYKE
ISSN	1861-2032
IngestDate	Wed Sep 03 07:08:48 EDT 2025 Fri Aug 29 12:38:03 EDT 2025 Sun Jun 29 14:41:55 EDT 2025 Thu Apr 24 23:09:27 EDT 2025 Tue Jul 01 03:01:14 EDT 2025 Fri Feb 21 02:34:49 EST 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	2
Keywords	Design Document XML
Language	English
License	Distributed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c483t-df19ef28d786e9a03c179e1d91f1550162522f889341d7aab75a9688e1e177e63
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0000-0002-3223-2186
OpenAccessLink	https://recercat.cat/handle/2072/330669
PQID	2052868119
PQPubID	2044317
PageCount	19
ParticipantIDs	hal_primary_oai_HAL_hal_01971563v1 csuc_recercat_oai_recercat_cat_2072_330669 proquest_journals_2052868119 crossref_citationtrail_10_1007_s13740_018_0088_0 crossref_primary_10_1007_s13740_018_0088_0 springer_journals_10_1007_s13740_018_0088_0
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2018-06-01
PublicationDateYYYYMMDD	2018-06-01
PublicationDate_xml	– month: 06 year: 2018 text: 2018-06-01 day: 01
PublicationDecade	2010
PublicationPlace	Berlin/Heidelberg
PublicationPlace_xml	– name: Berlin/Heidelberg – name: Heidelberg
PublicationSubtitle	Concepts and Ideas for Building Knowledgeable Systems
PublicationTitle	Journal on data semantics
PublicationTitleAbbrev	J Data Semant
PublicationYear	2018
Publisher	Springer Berlin Heidelberg Springer Nature B.V Springer
Publisher_xml	– name: Springer Berlin Heidelberg – name: Springer Nature B.V – name: Springer
References	ZhangZShashaDSimple fast algorithms for the editing distance between trees and related problemsSIAM J Comput198918612451262102547210.1137/02180820692.68047 Moh D-H, Lim E-P, Ng W-K (2000) Re-engineering structures from Web documents. In: 5th ACM conference on digital libraries (DL 2000). ACM, pp 67–76 Sanz I, Pérez J, Berlanga R, Aramburu M (2003) XML schemata inference and evolution. In: Proceedings of 14th international conference on databases and expert systems applications (DEXA’03), LNCS, vol 2736. Springer, pp 109–118 Jung J-S, Oh D-I, Kong Y-H, Ahn J-K (2002) Extracting information from XML documents by reverse generating a DTD. In: Proceedings of the EurAsia-ICT 2002, LNCS, vol 2510. Springer, pp 314–321 BertinoEGuerriniGMesitiMA matching algorithm for measuring the structural similarity between an XML document and a DTD and its applicationsInf Syst2004291234610.1016/S0306-4379(03)00031-0 W3C, Extensible Markup Language (XML) 1.0, 3rd Edition (February 2004) BexGJGeladeWNevenFVansummerenSLearning deterministic regular expressions for the inference of schemas from XML dataACM Trans Web20104414:114:3210.1145/1841909.1841911 Wang K, Liu H (1997) Schema discovery for semistructured data. In: 3rd International conference on knowledge discovery and data mining (KDD-97), pp 271–274 WidomJData management for XML: research directionsIEEE Data Eng Bull19992234452 Boobna U, de Rougemont M (2004) Correctors for XML data. In: Proceedings of 2nd international XML database symposium (XSYM’04), LNCS, vol 3186. Springer, pp 97–111 AlbertJGiammarresiDWoodDNormal form algorithms for extended context-free grammarsTheor Comput Sci20012671–23547185565310.1016/S0304-3975(00)00294-20984.68092 MinJ-KAhnJ-YCungC-WEfficient extraction of schemas for XML documentsInform Process Lett200385712195015610.1016/S0020-0190(02)00345-91042.68040 NayakRIryadiWXML schema clustering with semantic and hierarchical similarity measuresKnowl Based Syst200720433634910.1016/j.knosys.2006.08.006 BaaderFCalvaneseDMcGuinnessDNardiDPatel-SchneiderPThe description logic handbook2003CambridgeCambridge University Press1274.68451 Teege G (1994) Making the difference: a substraction operation for description logics. In: Proceedings of the international conference on principles of knowledge representation and reasoning (KR’94). Morgan Kaufmann, pp 540–550 AbiteboulSBunemanPSuciuDData on the Web–from relations to semistructured data and XML2000BurlingtonMorgan Kaufmann DalamagasTChengTWinkelK-JSellisTA methodology for clustering XML documents by structureInform Syst20063118722810.1016/j.is.2004.11.0091128.68345 WangLHassanzadehOZhangSShiJJiaoLZouJWangCSchema management for document storesProc VLDB Endow20158992293310.14778/2777598.2777601 GarofalakisMGionisARastogiRSechadriSShimKXTRACT: learning document type descriptors from XML document collectionsData Min Knowl Discov2003712356197370510.1023/A:1021560618289 BatageljVBrenMComparing resemblance measuresJ Classif19951217390134945310.1007/BF012022680833.62054 Nestorov S, Abiteboul S, Motwani R (1998) Extracting schema from semistructured data. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD 1998). ACM, pp 295–306 Hegewald J, Naumann F, Weis M (2006) XStruct: efficient schema extraction from multiple and large XML documents. In: Proceedings of the 22nd international conference on data engineering workshops, ICDE 2006, 3–7 Apr 2006, Atlanta, p 81 Izquierdo JLC, Cabot J (July 8-12, 2013) Discovering implicit schemas in JSON data. In: Web engineering—13th international conference, ICWE 2013, Aalborg, Proceedings, 2013, pp 68–83 Moh D-H, Lim E-P, Ng W-K (2000) DTD-miner: a tool for mining DTD from XML documents. In: Second international workshop on advance issues of E-commerce and web-based information systems (WECWIS 2000). IEEE Computer Society, pp 144–151 Estivill-Castro V, Yang J (2000) Fast and robust general purpose clustering algorithms. In: Proceedings of 6th Pacific Rim international conference on artificial intelligence (PRICAI 2000), LNCS, vol 1886. Springer, pp 208–218 GuerriniGMesitiMSanzIAkaliAPallisGAn overview of similarity measures for clustering XML documentsEmerging techniques and technologies: web data management practices2007HersheyIGI Global567810.4018/978-1-59904-228-2.ch003 LeonovAVKhusnutdinovRRStudy and development of the DTD generation system for XML documentsProgram Comput Softw200531419721010.1007/s11086-005-0032-61103.68479 LianWCheungDMamoulisNYiuS-MAn efficient and scalable algorithm for clustering XML documents by structureIEEE Trans Knowl Data Eng2004161829610.1109/TKDE.2004.1264824 Abiteboul S (1997) Querying semi-structured data. In: Proceedings of 6th international conference on database theory (ICDT’97), LNCS, vol 1186. Springer, pp 1–18 GallinucciEGolfarelliMRizziSSchema profiling of document-oriented databasesInform Syst201875132510.1016/j.is.2018.02.007 Klettke M, Störl U, Scherzinger S (2015) Schema extraction and structural outlier detection for JSON-based NOSQL data stores. In: Datenbanksysteme für Business, Technologie und Web (BTW), 16. Fachtagung des GI-Fachbereichs "Datenbanken und Informationssysteme" (DBIS), 4.-6.3.2015 in Hamburg, Proceedings, pp 425–444 E Bertino (88_CR2) 2004; 29 88_CR22 W Lian (88_CR27) 2004; 16 88_CR21 Z Zhang (88_CR24) 1989; 18 E Gallinucci (88_CR28) 2018; 75 GJ Bex (88_CR16) 2010; 4 J Widom (88_CR11) 1999; 22 R Nayak (88_CR10) 2007; 20 88_CR25 M Garofalakis (88_CR7) 2003; 7 J Albert (88_CR6) 2001; 267 T Dalamagas (88_CR23) 2006; 31 J-K Min (88_CR20) 2003; 85 S Abiteboul (88_CR3) 2000 V Batagelj (88_CR26) 1995; 12 88_CR31 88_CR30 88_CR17 88_CR1 88_CR18 L Wang (88_CR4) 2015; 8 88_CR15 88_CR13 88_CR5 88_CR14 88_CR8 G Guerrini (88_CR12) 2007 AV Leonov (88_CR19) 2005; 31 88_CR9 (88_CR29) 2003
References_xml	– reference: WidomJData management for XML: research directionsIEEE Data Eng Bull19992234452 – reference: NayakRIryadiWXML schema clustering with semantic and hierarchical similarity measuresKnowl Based Syst200720433634910.1016/j.knosys.2006.08.006 – reference: GarofalakisMGionisARastogiRSechadriSShimKXTRACT: learning document type descriptors from XML document collectionsData Min Knowl Discov2003712356197370510.1023/A:1021560618289 – reference: BertinoEGuerriniGMesitiMA matching algorithm for measuring the structural similarity between an XML document and a DTD and its applicationsInf Syst2004291234610.1016/S0306-4379(03)00031-0 – reference: Nestorov S, Abiteboul S, Motwani R (1998) Extracting schema from semistructured data. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD 1998). ACM, pp 295–306 – reference: Izquierdo JLC, Cabot J (July 8-12, 2013) Discovering implicit schemas in JSON data. In: Web engineering—13th international conference, ICWE 2013, Aalborg, Proceedings, 2013, pp 68–83 – reference: Sanz I, Pérez J, Berlanga R, Aramburu M (2003) XML schemata inference and evolution. In: Proceedings of 14th international conference on databases and expert systems applications (DEXA’03), LNCS, vol 2736. Springer, pp 109–118 – reference: ZhangZShashaDSimple fast algorithms for the editing distance between trees and related problemsSIAM J Comput198918612451262102547210.1137/02180820692.68047 – reference: Moh D-H, Lim E-P, Ng W-K (2000) DTD-miner: a tool for mining DTD from XML documents. In: Second international workshop on advance issues of E-commerce and web-based information systems (WECWIS 2000). IEEE Computer Society, pp 144–151 – reference: BexGJGeladeWNevenFVansummerenSLearning deterministic regular expressions for the inference of schemas from XML dataACM Trans Web20104414:114:3210.1145/1841909.1841911 – reference: Boobna U, de Rougemont M (2004) Correctors for XML data. In: Proceedings of 2nd international XML database symposium (XSYM’04), LNCS, vol 3186. Springer, pp 97–111 – reference: Abiteboul S (1997) Querying semi-structured data. In: Proceedings of 6th international conference on database theory (ICDT’97), LNCS, vol 1186. Springer, pp 1–18 – reference: AbiteboulSBunemanPSuciuDData on the Web–from relations to semistructured data and XML2000BurlingtonMorgan Kaufmann – reference: GuerriniGMesitiMSanzIAkaliAPallisGAn overview of similarity measures for clustering XML documentsEmerging techniques and technologies: web data management practices2007HersheyIGI Global567810.4018/978-1-59904-228-2.ch003 – reference: LianWCheungDMamoulisNYiuS-MAn efficient and scalable algorithm for clustering XML documents by structureIEEE Trans Knowl Data Eng2004161829610.1109/TKDE.2004.1264824 – reference: W3C, Extensible Markup Language (XML) 1.0, 3rd Edition (February 2004) – reference: Teege G (1994) Making the difference: a substraction operation for description logics. In: Proceedings of the international conference on principles of knowledge representation and reasoning (KR’94). Morgan Kaufmann, pp 540–550 – reference: Hegewald J, Naumann F, Weis M (2006) XStruct: efficient schema extraction from multiple and large XML documents. In: Proceedings of the 22nd international conference on data engineering workshops, ICDE 2006, 3–7 Apr 2006, Atlanta, p 81 – reference: Jung J-S, Oh D-I, Kong Y-H, Ahn J-K (2002) Extracting information from XML documents by reverse generating a DTD. In: Proceedings of the EurAsia-ICT 2002, LNCS, vol 2510. Springer, pp 314–321 – reference: MinJ-KAhnJ-YCungC-WEfficient extraction of schemas for XML documentsInform Process Lett200385712195015610.1016/S0020-0190(02)00345-91042.68040 – reference: AlbertJGiammarresiDWoodDNormal form algorithms for extended context-free grammarsTheor Comput Sci20012671–23547185565310.1016/S0304-3975(00)00294-20984.68092 – reference: Wang K, Liu H (1997) Schema discovery for semistructured data. In: 3rd International conference on knowledge discovery and data mining (KDD-97), pp 271–274 – reference: Klettke M, Störl U, Scherzinger S (2015) Schema extraction and structural outlier detection for JSON-based NOSQL data stores. In: Datenbanksysteme für Business, Technologie und Web (BTW), 16. Fachtagung des GI-Fachbereichs "Datenbanken und Informationssysteme" (DBIS), 4.-6.3.2015 in Hamburg, Proceedings, pp 425–444 – reference: BatageljVBrenMComparing resemblance measuresJ Classif19951217390134945310.1007/BF012022680833.62054 – reference: LeonovAVKhusnutdinovRRStudy and development of the DTD generation system for XML documentsProgram Comput Softw200531419721010.1007/s11086-005-0032-61103.68479 – reference: GallinucciEGolfarelliMRizziSSchema profiling of document-oriented databasesInform Syst201875132510.1016/j.is.2018.02.007 – reference: Estivill-Castro V, Yang J (2000) Fast and robust general purpose clustering algorithms. In: Proceedings of 6th Pacific Rim international conference on artificial intelligence (PRICAI 2000), LNCS, vol 1886. Springer, pp 208–218 – reference: DalamagasTChengTWinkelK-JSellisTA methodology for clustering XML documents by structureInform Syst20063118722810.1016/j.is.2004.11.0091128.68345 – reference: WangLHassanzadehOZhangSShiJJiaoLZouJWangCSchema management for document storesProc VLDB Endow20158992293310.14778/2777598.2777601 – reference: BaaderFCalvaneseDMcGuinnessDNardiDPatel-SchneiderPThe description logic handbook2003CambridgeCambridge University Press1274.68451 – reference: Moh D-H, Lim E-P, Ng W-K (2000) Re-engineering structures from Web documents. In: 5th ACM conference on digital libraries (DL 2000). ACM, pp 67–76 – ident: 88_CR25 – volume-title: Data on the Web–from relations to semistructured data and XML year: 2000 ident: 88_CR3 – ident: 88_CR31 doi: 10.1007/3-540-44533-1_24 – volume: 7 start-page: 23 issue: 1 year: 2003 ident: 88_CR7 publication-title: Data Min Knowl Discov doi: 10.1023/A:1021560618289 – volume-title: The description logic handbook year: 2003 ident: 88_CR29 – volume: 18 start-page: 1245 issue: 6 year: 1989 ident: 88_CR24 publication-title: SIAM J Comput doi: 10.1137/0218082 – volume: 75 start-page: 13 year: 2018 ident: 88_CR28 publication-title: Inform Syst doi: 10.1016/j.is.2018.02.007 – volume: 20 start-page: 336 issue: 4 year: 2007 ident: 88_CR10 publication-title: Knowl Based Syst doi: 10.1016/j.knosys.2006.08.006 – volume: 8 start-page: 922 issue: 9 year: 2015 ident: 88_CR4 publication-title: Proc VLDB Endow doi: 10.14778/2777598.2777601 – volume: 85 start-page: 7 year: 2003 ident: 88_CR20 publication-title: Inform Process Lett doi: 10.1016/S0020-0190(02)00345-9 – volume: 31 start-page: 197 issue: 4 year: 2005 ident: 88_CR19 publication-title: Program Comput Softw doi: 10.1007/s11086-005-0032-6 – ident: 88_CR22 doi: 10.1007/978-3-540-30081-6_8 – ident: 88_CR18 – volume: 4 start-page: 14:1 issue: 4 year: 2010 ident: 88_CR16 publication-title: ACM Trans Web doi: 10.1145/1841909.1841911 – volume: 22 start-page: 44 issue: 3 year: 1999 ident: 88_CR11 publication-title: IEEE Data Eng Bull – volume: 267 start-page: 35 issue: 1–2 year: 2001 ident: 88_CR6 publication-title: Theor Comput Sci doi: 10.1016/S0304-3975(00)00294-2 – ident: 88_CR9 doi: 10.1007/978-3-540-45227-0_12 – ident: 88_CR14 doi: 10.1109/ICDEW.2006.166 – ident: 88_CR1 doi: 10.1007/3-540-62222-5_33 – volume: 12 start-page: 73 issue: 1 year: 1995 ident: 88_CR26 publication-title: J Classif doi: 10.1007/BF01202268 – ident: 88_CR30 doi: 10.1016/B978-1-4832-1452-8.50145-7 – volume: 16 start-page: 82 issue: 1 year: 2004 ident: 88_CR27 publication-title: IEEE Trans Knowl Data Eng doi: 10.1109/TKDE.2004.1264824 – ident: 88_CR13 – start-page: 56 volume-title: Emerging techniques and technologies: web data management practices year: 2007 ident: 88_CR12 doi: 10.4018/978-1-59904-228-2.ch003 – volume: 31 start-page: 187 year: 2006 ident: 88_CR23 publication-title: Inform Syst doi: 10.1016/j.is.2004.11.009 – ident: 88_CR5 – ident: 88_CR21 doi: 10.1007/978-3-642-39200-9_8 – ident: 88_CR17 – ident: 88_CR15 – volume: 29 start-page: 23 issue: 1 year: 2004 ident: 88_CR2 publication-title: Inf Syst doi: 10.1016/S0306-4379(03)00031-0 – ident: 88_CR8 doi: 10.1145/276304.276331
SSID	ssj0000613680
Score	2.0564473
Snippet	The WWW contains a huge amount of documents. Some of them share the same subject, but are generated by different people or even by different organizations. A...
SourceID	hal csuc proquest crossref springer
SourceType	Open Access Repository Aggregation Database Enrichment Source Index Database Publisher
StartPage	87
SubjectTerms	Algorithms Artificial Intelligence Automatic data collection systems Classificació automàtica Computer Science Data mining Database Management Design Document Information Storage and Retrieval Information Systems Applications (incl.Internet) Informàtica IT in Business Mineria de dades Original Article Sistemes d'informació XML Àrees temàtiques de la UPC
Title	Approximating the Schema of a Set of Documents by Means of Resemblance
URI	https://link.springer.com/article/10.1007/s13740-018-0088-0 https://www.proquest.com/docview/2052868119 https://recercat.cat/handle/2072/330669 https://hal.science/hal-01971563
Volume	7
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bT9swFD4a5QUegHHROi6yEE9DQbGT-vJYppVqG7wMJHiyEtsBtFImmiLg13NOmrSAtkk8NGoTx3Lt43zfiY-_A7BnVOx5keaRcSGJUh13SPI2i5Dbuzg3DiGZVnSPT2T_LP1-3jmv93GPmmj3ZkmyelLPNrslikIRuY4Qt_AwB_NIP-K0BfPdo4sfs1crBFGyypnGteQR5Qhv1jP_Vs8rRGq50dghzlxRWOQLzvlmmbRCn94ynDbtngSd_D4Yl_mBe3oj6fjOP7YCSzUbZd2J-XyED2G4CstNpgdWT_xVWHwhW7gGvS4JkT9cE9kdXjKkkFiSxF_ZbcEy9iuU9AXha1ztn2P5IzsOCIl0liL9bvIB2do6nPW-nX7tR3U-hsilOikjX3ATCqG90jKYLE4czubAveEFOTocXSkhCo0MKOVeZVmuOpmRWgceuFJBJhvQGt4OwydghXdeIJV0IUP_0AmTepl4XeDIIeXTpg1xMybW1WLllDNjYGcyy9RrFnvNUq_ZuA1fprf8mSh1_L8wDrRFVAl3ListqWxPf9BHxErYBB0qiY3ZRXOYVkpF-92fls4hQVboASf3vA1bjbXY-jEwwko6QkvNOdax3wz-7PI_m_f5XaU3YUFU1kPvhragVd6NwzZSpTLfwanROzw82amnyDPD5QUj
linkProvider	Springer Nature
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3Pb9MwFH6C7gA7MCigdQywpp1AmWIn9Y9jhSiFtruslcbJSmxnILZuWtMJ-Ot5L03abgKkHRq1iWO59nO-78XP3wM4NCr2vEjzyLiQRKmOuyR5m0XI7V2cG4eQTCu642M5mKZfTrun9T7ueRPt3ixJVk_q9Wa3RFEoItcR4hYeHsJWii44mvVW79PX4frVCkGUrHKmcS15RDnCm_XMv9VzC5Fabr5wiDPfKCxyg3PeWSat0Ke_A5Om3cugkx9HizI_cr_vSDre8489hSc1G2W9pfk8gwdh1oadJtMDqyd-G7Y3ZAufQ79HQuQ_vxPZnZ0xpJBYksRf2WXBMnYSSvqC8LWo9s-x_BcbB4REOkuRfhf5OdnaC5j2P04-DKI6H0PkUp2UkS-4CYXQXmkZTBYnDmdz4N7wghwdjq6UEIVGBpRyr7IsV93MSK0DD1ypIJOX0JpdzsIusMI7L5BKupChf-iESb1MvC5w5JDyadOBuBkT62qxcsqZcW7XMsvUaxZ7zVKv2bgD71a3XC2VOv5fGAfaIqqEa5eVllS2Vz_oI2IlbIIOlcTGHKA5rCqlooPeyNI5JMgKPeDkhndgv7EWWz8G5lhJV2ipOcc63jeDv778z-bt3av0W3g0mIxHdvT5ePgKHovKkug90T60yutFeI20qczf1NPkD_2oBpY
linkToPdf	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1Zb9QwEB6VrYTgoRcgtpRiIZ5AaWMn6-NxBWyXXkKCSuXJJD4AUdKqm0Vtf31ndpPdUgES4iFRDsdK7Em-b-LxNwAvjEo9j3mZGBeyJNdpjyRviwS5vUtL4xCSaUT34FAOj_Ld495xk-d01Ea7t0OS0zkNpNJU1dtnPm7PJ75lisISuU4Qw3B1BxZzkrbrwGJ_59Pe_DcLwZWc5E_jWvKE8oW3Y5u_q-cXdOq40dgh5nylEMkb_PPWkOkEiQbL8Ll9hmkAyvetcV1uuatb8o7_8ZArsNSwVNafmtUqLIRqDZbbDBCs-SCswf0bcoYPYNAngfKLb0SCqy8MqSWWJFFYdhpZwT6EmjYQ1saTeXWsvGQHAaGSjlIE4I_yhGzwIRwN3n58PUyaPA2Jy3VWJz5yE6LQXmkZTJFmDt_ywL3hkRwgji6WEFEjM8q5V0VRql5hpNaBB65UkNkj6FSnVXgMLHrnBVJMFwr0G50wuZeZ1xF7EamgNl1I2_6xrhExp1waJ3Yuv0ytZrHVLLWaTbvwcnbJ2VTB4--FsdMtok04d0VtSX17tkOLSJWwGTpaEm_mOZrGrFIqOuzvWzqGxFmhZ5z95F3YaC3HNp-HEVbSE1pqzrGOV60hzE__8fbW_6n0M7j7_s3A7r873HsC98TEkOj30QZ06vNxeIpsqi43mzfmGrupD3o
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Approximating+the+Schema+of+a+Set+of+Documents+by+Means+of+Resemblance&rft.jtitle=Journal+on+data+semantics&rft.au=Abell%C3%B3%2C+Alberto&rft.au=de+Palol%2C+Xavier&rft.au=Hacid%2C+Mohand-Sa%C3%AFd&rft.date=2018-06-01&rft.issn=1861-2032&rft.eissn=1861-2040&rft.volume=7&rft.issue=2&rft.spage=87&rft.epage=105&rft_id=info:doi/10.1007%2Fs13740-018-0088-0&rft.externalDBID=n%2Fa&rft.externalDocID=10_1007_s13740_018_0088_0
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1861-2032&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1861-2032&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1861-2032&client=summon