What’s New? What’s Certain? – Scoring Search Results in the Presence of Overlapping Data Sources

Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain degree. Therefore, integrated search results may be supported by one, few, or all data sources. To reflect these differences, results should be r...

Full description

Saved in:

Bibliographic Details
Published in	Data Integration in the Life Sciences pp. 231 - 246
Main Authors	Hussels, Philipp, Trißl, Silke, Leser, Ulf
Format	Book Chapter
Language	English
Published	Berlin, Heidelberg Springer Berlin Heidelberg 2007
Series	Lecture Notes in Computer Science
Online Access	Get full text
ISBN	3540732543 9783540732549
ISSN	0302-9743 1611-3349
DOI	10.1007/978-3-540-73255-6_19

Cover

Loading…

Abstract	Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain degree. Therefore, integrated search results may be supported by one, few, or all data sources. To reflect these differences, results should be ranked according to the number of data sources that support them. How such a ranking should look like is not clear per se. Either, results supported by only few sources are ranked high because this information is potentially new, or such results are ranked low because the strength of evidence supporting them is limited. We present two scoring schemes to rank search results in the integrated protein annotation database Columba. We define a surprisingness score, preferring results supported by few sources, and a confidence score, preferring frequently encountered information. Unlike many other scoring schemes our proposal is purely data-driven and does not require users to specify preferences among sources. Both scores take the concrete overlaps of data sources into account and do not presume statistical independence. We show how our schemes have been implemented efficiently using SQL.
AbstractList	Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain degree. Therefore, integrated search results may be supported by one, few, or all data sources. To reflect these differences, results should be ranked according to the number of data sources that support them. How such a ranking should look like is not clear per se. Either, results supported by only few sources are ranked high because this information is potentially new, or such results are ranked low because the strength of evidence supporting them is limited. We present two scoring schemes to rank search results in the integrated protein annotation database Columba. We define a surprisingness score, preferring results supported by few sources, and a confidence score, preferring frequently encountered information. Unlike many other scoring schemes our proposal is purely data-driven and does not require users to specify preferences among sources. Both scores take the concrete overlaps of data sources into account and do not presume statistical independence. We show how our schemes have been implemented efficiently using SQL.
Author	Trißl, Silke Leser, Ulf Hussels, Philipp
Author_xml	– sequence: 1 givenname: Philipp surname: Hussels fullname: Hussels, Philipp email: hussels@informatik.hu-berlin.de organization: Humboldt-Universität zu Berlin, Institute of Computer Sciences, D-10099 Berlin, Germany – sequence: 2 givenname: Silke surname: Trißl fullname: Trißl, Silke email: trissl@informatik.hu-berlin.de organization: Humboldt-Universität zu Berlin, Institute of Computer Sciences, D-10099 Berlin, Germany – sequence: 3 givenname: Ulf surname: Leser fullname: Leser, Ulf email: leser@informatik.hu-berlin.de organization: Humboldt-Universität zu Berlin, Institute of Computer Sciences, D-10099 Berlin, Germany
BookMark	eNpFkM9OAjEQxqtiIiBv4KEvUG2323Z7Igb_JkSMaDw23TIVlOySdtEr7-DJ1-NJ7KLRuUzm-75MZn491KnqChA6YfSUUarOtCoIJyKnRPFMCCIN03uox5OyE9g-6jLJGOE81wf_Rs47qEs5zYhWOT9CgxhfaSqWYlp0kX-e22a7-Yr4Dj6G-G8aQWjsohri7eYTT10dFtULnoINbo4fIK6XTcSLCjdzwPcBIlQOcO3x5B3C0q5WbfrCNhZP63VwEI_RobfLCIPf3kdPV5ePoxsynlzfjs7HJDJdNCTnpXM5FZ57sFk5Y9qW6X1ReKllKZgEVTgqC-dVJgqrZrKcScmETpak3vM-yn72xlV7MQRT1vVbNIyaFqNJGA03iY3ZUTMtRv4NtL9lPQ
ContentType	Book Chapter
Copyright	Springer Berlin Heidelberg 2007
Copyright_xml	– notice: Springer Berlin Heidelberg 2007
DOI	10.1007/978-3-540-73255-6_19
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Biology Computer Science
EISBN	3540732551 9783540732556
EISSN	1611-3349
Editor	Cohen-Boulakia, Sarah Tannen, Val
Editor_xml	– sequence: 1 givenname: Sarah surname: Cohen-Boulakia fullname: Cohen-Boulakia, Sarah email: sarahcb@seas.upenn.edu – sequence: 2 givenname: Val surname: Tannen fullname: Tannen, Val email: val@cis.upenn.edu
EndPage	246
GroupedDBID	-DT -GH -~X 1SB 29L 2HA 2HV 5QI 875 AASHB ABMNI ACGFS ADCXD AEFIE ALMA_UNASSIGNED_HOLDINGS EJD F5P FEDTE HVGLF LAS LDH P2P RIG RNI RSU SVGTG VI1 ~02
ID	FETCH-LOGICAL-s198t-43bcc405f3fea2bd19ab10058f696b516e78c068cf7258a7d6bd6615951660ff3
ISBN	3540732543 9783540732549
ISSN	0302-9743
IngestDate	Tue Jul 29 19:59:02 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-s198t-43bcc405f3fea2bd19ab10058f696b516e78c068cf7258a7d6bd6615951660ff3
PageCount	16
ParticipantIDs	springer_books_10_1007_978_3_540_73255_6_19
PublicationCentury	2000
PublicationDate	2007
PublicationDateYYYYMMDD	2007-01-01
PublicationDate_xml	– year: 2007 text: 2007
PublicationDecade	2000
PublicationPlace	Berlin, Heidelberg
PublicationPlace_xml	– name: Berlin, Heidelberg
PublicationSeriesSubtitle	Lecture Notes in Bioinformatics
PublicationSeriesTitle	Lecture Notes in Computer Science
PublicationSeriesTitleAlternate	LNCS
PublicationSubtitle	4th International Workshop, DILS 2007, Philadelphia, PA, USA, June 27-29, 2007. Proceedings
PublicationTitle	Data Integration in the Life Sciences
PublicationYear	2007
Publisher	Springer Berlin Heidelberg
Publisher_xml	– name: Springer Berlin Heidelberg
SSID	ssj0000134995 ssj0002792
Score	1.6252074
Snippet	Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain...
SourceID	springer
SourceType	Publisher
StartPage	231
Title	What’s New? What’s Certain? – Scoring Search Results in the Presence of Overlapping Data Sources
URI	http://link.springer.com/10.1007/978-3-540-73255-6_19
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3datswFBZZRmHsYlu3sX90sV0FD9uyZfuqjNJQStYN2ozeGUuWRlhISu0Uuqu-w672FHunPsnO0Y_jpmXQ3YRYMbF0zmdJ5zs_IuS91JEUTIWBznIWJHEdBnmkFVymMi5EVoXmLILPh3x_mhycpCeDwZ9e1NKqFR_lz1vzSv5Hq9AGesUs2TtotvtTaIDvoF_4BA3D58bm9zrNamNZqrYyhN53p0QXsTiZaeVf2WattaaBVXBNoZx2JvvZDJ3lxdhwwUez-Y9O1RPVWIVO57oPLSz37YMkCjNPfmDj0Ubrro01wF98K4Nu2Xg_G-SMvoPVvG18z7-aVChpWIwv50g0nppsLjPSI-NlsANC4aoG-mf9H4fL1oSVjfwRFX741yiNbIPS8JTm6B8VvxxjlTE0cHtzJoMJHkwkO2cqO6dzrNTI3G1unnZLj13yY8uC3lhNNgNI8GlpwEssM3svy9Mhuf9p72DyrSP1Qiz2WKztL6zOaN1Ytlcuucj0mm11vJsdRS-x87ZH3nDVmx3Q8WPyELNiKKargICfkIFabJMte5LpxTZ55EVPneifEo2AuLr83VAAyA7trhwwdujV5S_qAEEtIKgDBJ0tKACCekDQpaY9QFAEBHWAeEam473j3f3AneoRNFGRt0HChJRgJmimVRWLOioqEeHplpoXXKQRV1kuQ55LncVpXmU1FzVsIlMwBTgPtWbPyXCxXKgXhFaq5kqJrJYySqQMBe4_sQJllbBYCfGSjLzISnxPm9IX6QYBl6wEAZdGwCUK-NWd7n5NHqyh-4YM27OVegv701a8c6j4C0BYhM4
linkProvider	Library Specific Holdings
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Data+Integration+in+the+Life+Sciences&rft.au=Hussels%2C+Philipp&rft.au=Tri%C3%9Fl%2C+Silke&rft.au=Leser%2C+Ulf&rft.atitle=What%E2%80%99s+New%3F+What%E2%80%99s+Certain%3F+%E2%80%93+Scoring+Search+Results+in+the+Presence+of+Overlapping+Data+Sources&rft.series=Lecture+Notes+in+Computer+Science&rft.date=2007-01-01&rft.pub=Springer+Berlin+Heidelberg&rft.isbn=9783540732549&rft.issn=0302-9743&rft.eissn=1611-3349&rft.spage=231&rft.epage=246&rft_id=info:doi/10.1007%2F978-3-540-73255-6_19
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0302-9743&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0302-9743&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0302-9743&client=summon