What’s New? What’s Certain? – Scoring Search Results in the Presence of Overlapping Data Sources
Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain degree. Therefore, integrated search results may be supported by one, few, or all data sources. To reflect these differences, results should be r...
Saved in:
Published in | Data Integration in the Life Sciences pp. 231 - 246 |
---|---|
Main Authors | , , |
Format | Book Chapter |
Language | English |
Published |
Berlin, Heidelberg
Springer Berlin Heidelberg
2007
|
Series | Lecture Notes in Computer Science |
Online Access | Get full text |
ISBN | 3540732543 9783540732549 |
ISSN | 0302-9743 1611-3349 |
DOI | 10.1007/978-3-540-73255-6_19 |
Cover
Loading…
Abstract | Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain degree. Therefore, integrated search results may be supported by one, few, or all data sources. To reflect these differences, results should be ranked according to the number of data sources that support them. How such a ranking should look like is not clear per se. Either, results supported by only few sources are ranked high because this information is potentially new, or such results are ranked low because the strength of evidence supporting them is limited.
We present two scoring schemes to rank search results in the integrated protein annotation database Columba. We define a surprisingness score, preferring results supported by few sources, and a confidence score, preferring frequently encountered information. Unlike many other scoring schemes our proposal is purely data-driven and does not require users to specify preferences among sources. Both scores take the concrete overlaps of data sources into account and do not presume statistical independence. We show how our schemes have been implemented efficiently using SQL. |
---|---|
AbstractList | Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain degree. Therefore, integrated search results may be supported by one, few, or all data sources. To reflect these differences, results should be ranked according to the number of data sources that support them. How such a ranking should look like is not clear per se. Either, results supported by only few sources are ranked high because this information is potentially new, or such results are ranked low because the strength of evidence supporting them is limited.
We present two scoring schemes to rank search results in the integrated protein annotation database Columba. We define a surprisingness score, preferring results supported by few sources, and a confidence score, preferring frequently encountered information. Unlike many other scoring schemes our proposal is purely data-driven and does not require users to specify preferences among sources. Both scores take the concrete overlaps of data sources into account and do not presume statistical independence. We show how our schemes have been implemented efficiently using SQL. |
Author | Trißl, Silke Leser, Ulf Hussels, Philipp |
Author_xml | – sequence: 1 givenname: Philipp surname: Hussels fullname: Hussels, Philipp email: hussels@informatik.hu-berlin.de organization: Humboldt-Universität zu Berlin, Institute of Computer Sciences, D-10099 Berlin, Germany – sequence: 2 givenname: Silke surname: Trißl fullname: Trißl, Silke email: trissl@informatik.hu-berlin.de organization: Humboldt-Universität zu Berlin, Institute of Computer Sciences, D-10099 Berlin, Germany – sequence: 3 givenname: Ulf surname: Leser fullname: Leser, Ulf email: leser@informatik.hu-berlin.de organization: Humboldt-Universität zu Berlin, Institute of Computer Sciences, D-10099 Berlin, Germany |
BookMark | eNpFkM9OAjEQxqtiIiBv4KEvUG2323Z7Igb_JkSMaDw23TIVlOySdtEr7-DJ1-NJ7KLRuUzm-75MZn491KnqChA6YfSUUarOtCoIJyKnRPFMCCIN03uox5OyE9g-6jLJGOE81wf_Rs47qEs5zYhWOT9CgxhfaSqWYlp0kX-e22a7-Yr4Dj6G-G8aQWjsohri7eYTT10dFtULnoINbo4fIK6XTcSLCjdzwPcBIlQOcO3x5B3C0q5WbfrCNhZP63VwEI_RobfLCIPf3kdPV5ePoxsynlzfjs7HJDJdNCTnpXM5FZ57sFk5Y9qW6X1ReKllKZgEVTgqC-dVJgqrZrKcScmETpak3vM-yn72xlV7MQRT1vVbNIyaFqNJGA03iY3ZUTMtRv4NtL9lPQ |
ContentType | Book Chapter |
Copyright | Springer Berlin Heidelberg 2007 |
Copyright_xml | – notice: Springer Berlin Heidelberg 2007 |
DOI | 10.1007/978-3-540-73255-6_19 |
DatabaseTitleList | |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Biology Computer Science |
EISBN | 3540732551 9783540732556 |
EISSN | 1611-3349 |
Editor | Cohen-Boulakia, Sarah Tannen, Val |
Editor_xml | – sequence: 1 givenname: Sarah surname: Cohen-Boulakia fullname: Cohen-Boulakia, Sarah email: sarahcb@seas.upenn.edu – sequence: 2 givenname: Val surname: Tannen fullname: Tannen, Val email: val@cis.upenn.edu |
EndPage | 246 |
GroupedDBID | -DT -GH -~X 1SB 29L 2HA 2HV 5QI 875 AASHB ABMNI ACGFS ADCXD AEFIE ALMA_UNASSIGNED_HOLDINGS EJD F5P FEDTE HVGLF LAS LDH P2P RIG RNI RSU SVGTG VI1 ~02 |
ID | FETCH-LOGICAL-s198t-43bcc405f3fea2bd19ab10058f696b516e78c068cf7258a7d6bd6615951660ff3 |
ISBN | 3540732543 9783540732549 |
ISSN | 0302-9743 |
IngestDate | Tue Jul 29 19:59:02 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-s198t-43bcc405f3fea2bd19ab10058f696b516e78c068cf7258a7d6bd6615951660ff3 |
PageCount | 16 |
ParticipantIDs | springer_books_10_1007_978_3_540_73255_6_19 |
PublicationCentury | 2000 |
PublicationDate | 2007 |
PublicationDateYYYYMMDD | 2007-01-01 |
PublicationDate_xml | – year: 2007 text: 2007 |
PublicationDecade | 2000 |
PublicationPlace | Berlin, Heidelberg |
PublicationPlace_xml | – name: Berlin, Heidelberg |
PublicationSeriesSubtitle | Lecture Notes in Bioinformatics |
PublicationSeriesTitle | Lecture Notes in Computer Science |
PublicationSeriesTitleAlternate | LNCS |
PublicationSubtitle | 4th International Workshop, DILS 2007, Philadelphia, PA, USA, June 27-29, 2007. Proceedings |
PublicationTitle | Data Integration in the Life Sciences |
PublicationYear | 2007 |
Publisher | Springer Berlin Heidelberg |
Publisher_xml | – name: Springer Berlin Heidelberg |
SSID | ssj0000134995 ssj0002792 |
Score | 1.6252074 |
Snippet | Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain... |
SourceID | springer |
SourceType | Publisher |
StartPage | 231 |
Title | What’s New? What’s Certain? – Scoring Search Results in the Presence of Overlapping Data Sources |
URI | http://link.springer.com/10.1007/978-3-540-73255-6_19 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3datswFBZZRmHsYlu3sX90sV0FD9uyZfuqjNJQStYN2ozeGUuWRlhISu0Uuqu-w672FHunPsnO0Y_jpmXQ3YRYMbF0zmdJ5zs_IuS91JEUTIWBznIWJHEdBnmkFVymMi5EVoXmLILPh3x_mhycpCeDwZ9e1NKqFR_lz1vzSv5Hq9AGesUs2TtotvtTaIDvoF_4BA3D58bm9zrNamNZqrYyhN53p0QXsTiZaeVf2WattaaBVXBNoZx2JvvZDJ3lxdhwwUez-Y9O1RPVWIVO57oPLSz37YMkCjNPfmDj0Ubrro01wF98K4Nu2Xg_G-SMvoPVvG18z7-aVChpWIwv50g0nppsLjPSI-NlsANC4aoG-mf9H4fL1oSVjfwRFX741yiNbIPS8JTm6B8VvxxjlTE0cHtzJoMJHkwkO2cqO6dzrNTI3G1unnZLj13yY8uC3lhNNgNI8GlpwEssM3svy9Mhuf9p72DyrSP1Qiz2WKztL6zOaN1Ytlcuucj0mm11vJsdRS-x87ZH3nDVmx3Q8WPyELNiKKargICfkIFabJMte5LpxTZ55EVPneifEo2AuLr83VAAyA7trhwwdujV5S_qAEEtIKgDBJ0tKACCekDQpaY9QFAEBHWAeEam473j3f3AneoRNFGRt0HChJRgJmimVRWLOioqEeHplpoXXKQRV1kuQ55LncVpXmU1FzVsIlMwBTgPtWbPyXCxXKgXhFaq5kqJrJYySqQMBe4_sQJllbBYCfGSjLzISnxPm9IX6QYBl6wEAZdGwCUK-NWd7n5NHqyh-4YM27OVegv701a8c6j4C0BYhM4 |
linkProvider | Library Specific Holdings |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Data+Integration+in+the+Life+Sciences&rft.au=Hussels%2C+Philipp&rft.au=Tri%C3%9Fl%2C+Silke&rft.au=Leser%2C+Ulf&rft.atitle=What%E2%80%99s+New%3F+What%E2%80%99s+Certain%3F+%E2%80%93+Scoring+Search+Results+in+the+Presence+of+Overlapping+Data+Sources&rft.series=Lecture+Notes+in+Computer+Science&rft.date=2007-01-01&rft.pub=Springer+Berlin+Heidelberg&rft.isbn=9783540732549&rft.issn=0302-9743&rft.eissn=1611-3349&rft.spage=231&rft.epage=246&rft_id=info:doi/10.1007%2F978-3-540-73255-6_19 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0302-9743&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0302-9743&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0302-9743&client=summon |