What’s New? What’s Certain? – Scoring Search Results in the Presence of Overlapping Data Sources

Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain degree. Therefore, integrated search results may be supported by one, few, or all data sources. To reflect these differences, results should be r...

Full description

Saved in:
Bibliographic Details
Published inData Integration in the Life Sciences pp. 231 - 246
Main Authors Hussels, Philipp, Trißl, Silke, Leser, Ulf
Format Book Chapter
LanguageEnglish
Published Berlin, Heidelberg Springer Berlin Heidelberg 2007
SeriesLecture Notes in Computer Science
Online AccessGet full text
ISBN3540732543
9783540732549
ISSN0302-9743
1611-3349
DOI10.1007/978-3-540-73255-6_19

Cover

Loading…
Abstract Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain degree. Therefore, integrated search results may be supported by one, few, or all data sources. To reflect these differences, results should be ranked according to the number of data sources that support them. How such a ranking should look like is not clear per se. Either, results supported by only few sources are ranked high because this information is potentially new, or such results are ranked low because the strength of evidence supporting them is limited. We present two scoring schemes to rank search results in the integrated protein annotation database Columba. We define a surprisingness score, preferring results supported by few sources, and a confidence score, preferring frequently encountered information. Unlike many other scoring schemes our proposal is purely data-driven and does not require users to specify preferences among sources. Both scores take the concrete overlaps of data sources into account and do not presume statistical independence. We show how our schemes have been implemented efficiently using SQL.
AbstractList Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain degree. Therefore, integrated search results may be supported by one, few, or all data sources. To reflect these differences, results should be ranked according to the number of data sources that support them. How such a ranking should look like is not clear per se. Either, results supported by only few sources are ranked high because this information is potentially new, or such results are ranked low because the strength of evidence supporting them is limited. We present two scoring schemes to rank search results in the integrated protein annotation database Columba. We define a surprisingness score, preferring results supported by few sources, and a confidence score, preferring frequently encountered information. Unlike many other scoring schemes our proposal is purely data-driven and does not require users to specify preferences among sources. Both scores take the concrete overlaps of data sources into account and do not presume statistical independence. We show how our schemes have been implemented efficiently using SQL.
Author Trißl, Silke
Leser, Ulf
Hussels, Philipp
Author_xml – sequence: 1
  givenname: Philipp
  surname: Hussels
  fullname: Hussels, Philipp
  email: hussels@informatik.hu-berlin.de
  organization: Humboldt-Universität zu Berlin, Institute of Computer Sciences, D-10099 Berlin, Germany
– sequence: 2
  givenname: Silke
  surname: Trißl
  fullname: Trißl, Silke
  email: trissl@informatik.hu-berlin.de
  organization: Humboldt-Universität zu Berlin, Institute of Computer Sciences, D-10099 Berlin, Germany
– sequence: 3
  givenname: Ulf
  surname: Leser
  fullname: Leser, Ulf
  email: leser@informatik.hu-berlin.de
  organization: Humboldt-Universität zu Berlin, Institute of Computer Sciences, D-10099 Berlin, Germany
BookMark eNpFkM9OAjEQxqtiIiBv4KEvUG2323Z7Igb_JkSMaDw23TIVlOySdtEr7-DJ1-NJ7KLRuUzm-75MZn491KnqChA6YfSUUarOtCoIJyKnRPFMCCIN03uox5OyE9g-6jLJGOE81wf_Rs47qEs5zYhWOT9CgxhfaSqWYlp0kX-e22a7-Yr4Dj6G-G8aQWjsohri7eYTT10dFtULnoINbo4fIK6XTcSLCjdzwPcBIlQOcO3x5B3C0q5WbfrCNhZP63VwEI_RobfLCIPf3kdPV5ePoxsynlzfjs7HJDJdNCTnpXM5FZ57sFk5Y9qW6X1ReKllKZgEVTgqC-dVJgqrZrKcScmETpak3vM-yn72xlV7MQRT1vVbNIyaFqNJGA03iY3ZUTMtRv4NtL9lPQ
ContentType Book Chapter
Copyright Springer Berlin Heidelberg 2007
Copyright_xml – notice: Springer Berlin Heidelberg 2007
DOI 10.1007/978-3-540-73255-6_19
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Biology
Computer Science
EISBN 3540732551
9783540732556
EISSN 1611-3349
Editor Cohen-Boulakia, Sarah
Tannen, Val
Editor_xml – sequence: 1
  givenname: Sarah
  surname: Cohen-Boulakia
  fullname: Cohen-Boulakia, Sarah
  email: sarahcb@seas.upenn.edu
– sequence: 2
  givenname: Val
  surname: Tannen
  fullname: Tannen, Val
  email: val@cis.upenn.edu
EndPage 246
GroupedDBID -DT
-GH
-~X
1SB
29L
2HA
2HV
5QI
875
AASHB
ABMNI
ACGFS
ADCXD
AEFIE
ALMA_UNASSIGNED_HOLDINGS
EJD
F5P
FEDTE
HVGLF
LAS
LDH
P2P
RIG
RNI
RSU
SVGTG
VI1
~02
ID FETCH-LOGICAL-s198t-43bcc405f3fea2bd19ab10058f696b516e78c068cf7258a7d6bd6615951660ff3
ISBN 3540732543
9783540732549
ISSN 0302-9743
IngestDate Tue Jul 29 19:59:02 EDT 2025
IsPeerReviewed true
IsScholarly true
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-s198t-43bcc405f3fea2bd19ab10058f696b516e78c068cf7258a7d6bd6615951660ff3
PageCount 16
ParticipantIDs springer_books_10_1007_978_3_540_73255_6_19
PublicationCentury 2000
PublicationDate 2007
PublicationDateYYYYMMDD 2007-01-01
PublicationDate_xml – year: 2007
  text: 2007
PublicationDecade 2000
PublicationPlace Berlin, Heidelberg
PublicationPlace_xml – name: Berlin, Heidelberg
PublicationSeriesSubtitle Lecture Notes in Bioinformatics
PublicationSeriesTitle Lecture Notes in Computer Science
PublicationSeriesTitleAlternate LNCS
PublicationSubtitle 4th International Workshop, DILS 2007, Philadelphia, PA, USA, June 27-29, 2007. Proceedings
PublicationTitle Data Integration in the Life Sciences
PublicationYear 2007
Publisher Springer Berlin Heidelberg
Publisher_xml – name: Springer Berlin Heidelberg
SSID ssj0000134995
ssj0002792
Score 1.6252074
Snippet Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain...
SourceID springer
SourceType Publisher
StartPage 231
Title What’s New? What’s Certain? – Scoring Search Results in the Presence of Overlapping Data Sources
URI http://link.springer.com/10.1007/978-3-540-73255-6_19
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3datswFBZZRmHsYlu3sX90sV0FD9uyZfuqjNJQStYN2ozeGUuWRlhISu0Uuqu-w672FHunPsnO0Y_jpmXQ3YRYMbF0zmdJ5zs_IuS91JEUTIWBznIWJHEdBnmkFVymMi5EVoXmLILPh3x_mhycpCeDwZ9e1NKqFR_lz1vzSv5Hq9AGesUs2TtotvtTaIDvoF_4BA3D58bm9zrNamNZqrYyhN53p0QXsTiZaeVf2WattaaBVXBNoZx2JvvZDJ3lxdhwwUez-Y9O1RPVWIVO57oPLSz37YMkCjNPfmDj0Ubrro01wF98K4Nu2Xg_G-SMvoPVvG18z7-aVChpWIwv50g0nppsLjPSI-NlsANC4aoG-mf9H4fL1oSVjfwRFX741yiNbIPS8JTm6B8VvxxjlTE0cHtzJoMJHkwkO2cqO6dzrNTI3G1unnZLj13yY8uC3lhNNgNI8GlpwEssM3svy9Mhuf9p72DyrSP1Qiz2WKztL6zOaN1Ytlcuucj0mm11vJsdRS-x87ZH3nDVmx3Q8WPyELNiKKargICfkIFabJMte5LpxTZ55EVPneifEo2AuLr83VAAyA7trhwwdujV5S_qAEEtIKgDBJ0tKACCekDQpaY9QFAEBHWAeEam473j3f3AneoRNFGRt0HChJRgJmimVRWLOioqEeHplpoXXKQRV1kuQ55LncVpXmU1FzVsIlMwBTgPtWbPyXCxXKgXhFaq5kqJrJYySqQMBe4_sQJllbBYCfGSjLzISnxPm9IX6QYBl6wEAZdGwCUK-NWd7n5NHqyh-4YM27OVegv701a8c6j4C0BYhM4
linkProvider Library Specific Holdings
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Data+Integration+in+the+Life+Sciences&rft.au=Hussels%2C+Philipp&rft.au=Tri%C3%9Fl%2C+Silke&rft.au=Leser%2C+Ulf&rft.atitle=What%E2%80%99s+New%3F+What%E2%80%99s+Certain%3F+%E2%80%93+Scoring+Search+Results+in+the+Presence+of+Overlapping+Data+Sources&rft.series=Lecture+Notes+in+Computer+Science&rft.date=2007-01-01&rft.pub=Springer+Berlin+Heidelberg&rft.isbn=9783540732549&rft.issn=0302-9743&rft.eissn=1611-3349&rft.spage=231&rft.epage=246&rft_id=info:doi/10.1007%2F978-3-540-73255-6_19
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0302-9743&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0302-9743&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0302-9743&client=summon