Language Independent Tokenization vs. Stemming in Automated Detection of Health Websites’ HONcode Conformity: An Evaluation

Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of conduct (www.hon.ch/Conduct.html). The current study used varying length character n-gram vectors to represent healthcare web page documents –...

Full description

Saved in:
Bibliographic Details
Published inProcedia computer science Vol. 64; pp. 224 - 231
Main Authors Boyer, Célia, Dolamic, Ljiljana, Falquet, Gilles
Format Journal Article
LanguageEnglish
Published Elsevier B.V 2015
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of conduct (www.hon.ch/Conduct.html). The current study used varying length character n-gram vectors to represent healthcare web page documents – not the traditional approach of using word vectors. The training/testing collection comprised web page fragments that HONcode experts had cited as the basis for individual HONcode compliance during the manual certification process (described below). The authors compared automated classification performance of n-gram tokenization to the automated classification performance of document words and Porter-stemmed document words using a Naive Bayes classifier and DF (document frequency) dimensionality reduction metrics. The study attempted to determine whether the automated, language-independent approach might safely replace single word-based classification. Using 5-grams as document features, authors also compared the baseline DF reduction function to Chi-square and Z-score dimensionality reductions. While the Z-score approach statistically significantly improved precision for some HONcode compliance components, the Chi-square performance was unreliable, performing very well for some criteria and poorly for others. Overall study results indicate that n-gram tokenization provide a potentially viable alternative to document word stemming.
AbstractList Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of conduct (www.hon.ch/Conduct.html). The current study used varying length character n-gram vectors to represent healthcare web page documents – not the traditional approach of using word vectors. The training/testing collection comprised web page fragments that HONcode experts had cited as the basis for individual HONcode compliance during the manual certification process (described below). The authors compared automated classification performance of n-gram tokenization to the automated classification performance of document words and Porter-stemmed document words using a Naive Bayes classifier and DF (document frequency) dimensionality reduction metrics. The study attempted to determine whether the automated, language-independent approach might safely replace single word-based classification. Using 5-grams as document features, authors also compared the baseline DF reduction function to Chi-square and Z-score dimensionality reductions. While the Z-score approach statistically significantly improved precision for some HONcode compliance components, the Chi-square performance was unreliable, performing very well for some criteria and poorly for others. Overall study results indicate that n-gram tokenization provide a potentially viable alternative to document word stemming.
Author Falquet, Gilles
Dolamic, Ljiljana
Boyer, Célia
Author_xml – sequence: 1
  givenname: Célia
  surname: Boyer
  fullname: Boyer, Célia
  email: celia.boyer@healthonnet.org
  organization: Health on the Net Foundation, Geneva, Switzerland
– sequence: 2
  givenname: Ljiljana
  surname: Dolamic
  fullname: Dolamic, Ljiljana
  organization: Health on the Net Foundation, Geneva, Switzerland
– sequence: 3
  givenname: Gilles
  surname: Falquet
  fullname: Falquet, Gilles
  organization: University of Geneva, Geneva, Switzerland
BookMark eNp9kE1OwzAQhS1UJErpCdj4Agl2ncQOEouq_LRSRRcUsbQcZ1JcGruK3UpFQuIaXI-TkLYsWDGLmVnMe3rznaOOdRYQuqQkpoRmV8t43Tjt4wGhaUxEnIjkBHWp4DwiKck7f_Yz1Pd-SdpiQuSUd9HHVNnFRi0AT2wJa2ibDXju3sCadxWMs3jrY_wUoK6NXWBj8XATXK0ClPgWAujDjavwGNQqvOIXKLwJ4L8_v_B49qhdCXjkbOWa2oTdNR5afLdVq83B-wKdVmrlof87e-j5_m4-GkfT2cNkNJxGmiUiRFWqWcXVgHORJ1UpSpZqkeRpLjLCWEqFKipVlDqnUCSZznhREJYrGGQZcEoS1kPs6Ksb530DlVw3plbNTlIi9xDlUh4gyj1ESYRsIbaqm6MK2mhbA4302oDVUJqm_VuWzvyr_wHEnYB0
CitedBy_id crossref_primary_10_1108_OIR_01_2017_0028
crossref_primary_10_2196_52995
Cites_doi 10.1145/1390334.1390518
10.14569/IJACSA.2014.050309
10.4066/AMJ.2014.1900
10.1007/978-3-540-73599-1_24
10.1007/s100320050012
10.1016/j.chb.2010.03.013
10.1016/S0306-4573(97)00027-7
10.1007/978-3-540-30222-3_27
10.3115/1572433.1572439
10.1017/CBO9780511809071
10.1016/j.socscimed.2007.01.012
10.1023/B:INRT.0000009441.78971.be
10.1145/505282.505283
10.2196/jmir.3831
ContentType Journal Article
Copyright 2015 The Authors
Copyright_xml – notice: 2015 The Authors
DBID 6I.
AAFTH
AAYXX
CITATION
DOI 10.1016/j.procs.2015.08.484
DatabaseName ScienceDirect Open Access Titles
Elsevier:ScienceDirect:Open Access
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1877-0509
EndPage 231
ExternalDocumentID 10_1016_j_procs_2015_08_484
S1877050915026198
GroupedDBID --K
0R~
0SF
1B1
457
5VS
6I.
71M
AACTN
AAEDT
AAEDW
AAFTH
AAIKJ
AALRI
AAQFI
AAXUO
ABMAC
ACGFS
ADBBV
ADEZE
AEXQZ
AFTJW
AGHFR
AITUG
ALMA_UNASSIGNED_HOLDINGS
AMRAJ
E3Z
EBS
EJD
EP3
FDB
FNPLU
HZ~
IXB
KQ8
M41
M~E
NCXOZ
O-L
O9-
OK1
P2P
RIG
ROL
SES
SSZ
AAYXX
ADVLN
AKRWK
CITATION
ID FETCH-LOGICAL-c348t-f5c3f7a277894fd8d35c8495986033518abfabdc91eb46c67bb039ae266e71043
IEDL.DBID IXB
ISSN 1877-0509
IngestDate Fri Aug 23 00:54:33 EDT 2024
Wed May 17 00:08:13 EDT 2023
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords HONcode
N-gram
Machine learning
Language English
License http://creativecommons.org/licenses/by-nc-nd/4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c348t-f5c3f7a277894fd8d35c8495986033518abfabdc91eb46c67bb039ae266e71043
OpenAccessLink https://www.sciencedirect.com/science/article/pii/S1877050915026198
PageCount 8
ParticipantIDs crossref_primary_10_1016_j_procs_2015_08_484
elsevier_sciencedirect_doi_10_1016_j_procs_2015_08_484
PublicationCentury 2000
PublicationDate 2015
2015-00-00
PublicationDateYYYYMMDD 2015-01-01
PublicationDate_xml – year: 2015
  text: 2015
PublicationDecade 2010
PublicationTitle Procedia computer science
PublicationYear 2015
Publisher Elsevier B.V
Publisher_xml – name: Elsevier B.V
References Fahy E., Hardikar R., Fox A., Mackay S. Quality of patient health information on the Internet: reviewing a complex and evolving landscape. Australasian Med J. 2014; 7(1) 24-28. PMID: 24567763.
Mc Namee P., Mayfield J., Nicholas C.K. Don’t have a stemmer?: be un+concern+ed. In Sung-Hyon Myaeng, Douglas W. Oard, Fabrizio Sebastiani, Tat-Seng Chua, and Mun-Kew Leong, editors, SIGIR, ACM, 2008; 813-814.
Eysenbach G., Powell J., Kuss O., Sa ER. Empirical Studies Assessing the Quality of Health Information for Consumers on the World Wide Web – A Systematic Review Journal of the American Medical Association JAMA 2002; 287(20):2691-2700.
Manning C.D., Raghavan P., Schutze H. Introduction to information retrieval. Cambridge University Press, 2008.
Sebastiani (bib0050) 2002; 34
Mc Namee, Mayfield (bib0025) 2004; 7
Sillence, Briggs, Harris, Fishwick (bib0090) 2007; 64
Tomlinson, S. (2004). Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServerTM at CLEF 2003. In Comparative evalua-tion of multilingual information access systems. LNCS #3237 (pp. 286-300). Berlin: Springer-Verlag.
Zubaryeva O., Savoy J. Investigation in statistical lan-guage-independent approaches for opinion detection in English, Chinese and Japanese. In Proceedings of the Third International Workshop on Cross Lingual Information Ac-cess: Addressing the Information Need of Multilingual So-cieties (CLIAWS3 ‘09). Association for Computational Linguistics, Stroudsburg, PA, USA, 2009;38-45.
Gaudinat, Grabar, Boyer (bib0040) 2007
Boyer, Dolamic (bib0110) 2015; 17
Savoy (bib0105) 1997; 33
Williams K., Calvo R.A. A framework for document catego-rization. 7th Australasian Document Computing Symposi-um. December 2002. Sydney, Australia. 13-19.
Boyer, Dolamic (bib0005) Mar 2014; 5
Gaudinat A., Grabar N., Boyer C. Machine learning ap-proach for automatic quality criteria detection of health web pages. In Klaus A. Kuhn, James R. Warren, and Tze-Yun Leong, editors, MedInfo, Studies in Health Technolo-gy and Informatics, IOS Press, 2007; 129:705-709.
Baeza-Yates R.A., Ribeiro-Neto B. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. 1999.
Cavnar W.B., Trenkle J.M. N-Gram-Based Text Catego-rization, In Proceedings of SDAIR-94, 3rd Annual Sympo-sium on Document Analysis and Information Retrieval, 1994, 161-175.
European citizens’ digital health literacy, 2014-12-04. URL:http://ec.europa.eu/public_opinion/flash/fl_404_en.pdf.
Beldad, de Jong, Steehouder (bib0080) 2010; 26
Naji N., Savoy J., Dolamic L. Recherche d’information dans un corpus bruité(ocr). In Gabriella Pasiand, Patrice Bellot, editors, CORIA, 6 Editions Universitaires d’Avignon, 2011; 271-28.
Boyer C., Baujard V., Scherrer J. HONcode: a standard to improve the quality of medical/health information on the internet and HON's 5th survey on the use of internet for medical and health purposes. In 6th Internet World Con-gress for Biomedical Sciences (INABIS 2000), 1999.
Junker, Hoch (bib0060) 1998; 1
Boyer (10.1016/j.procs.2015.08.484_bib0005) 2014; 5
Beldad (10.1016/j.procs.2015.08.484_bib0080) 2010; 26
Sebastiani (10.1016/j.procs.2015.08.484_bib0050) 2002; 34
10.1016/j.procs.2015.08.484_bib0055
10.1016/j.procs.2015.08.484_bib0010
10.1016/j.procs.2015.08.484_bib0065
Boyer (10.1016/j.procs.2015.08.484_bib0110) 2015; 17
10.1016/j.procs.2015.08.484_bib0035
Gaudinat (10.1016/j.procs.2015.08.484_bib0040) 2007
10.1016/j.procs.2015.08.484_bib0045
10.1016/j.procs.2015.08.484_bib0100
10.1016/j.procs.2015.08.484_bib0015
Savoy (10.1016/j.procs.2015.08.484_bib0105) 1997; 33
10.1016/j.procs.2015.08.484_bib0070
10.1016/j.procs.2015.08.484_bib0095
10.1016/j.procs.2015.08.484_bib0020
10.1016/j.procs.2015.08.484_bib0075
10.1016/j.procs.2015.08.484_bib0030
10.1016/j.procs.2015.08.484_bib0085
Junker (10.1016/j.procs.2015.08.484_bib0060) 1998; 1
Sillence (10.1016/j.procs.2015.08.484_bib0090) 2007; 64
Mc Namee (10.1016/j.procs.2015.08.484_bib0025) 2004; 7
References_xml – volume: 7
  start-page: 73
  year: 2004
  end-page: 97
  ident: bib0025
  article-title: Character n-gram tokenization for european language text retrieval
  publication-title: Information Retrieval
  contributor:
    fullname: Mayfield
– volume: 64
  start-page: 1853
  year: 2007
  end-page: 1862
  ident: bib0090
  article-title: How do pa-tients evaluate and make use of online health information?
  publication-title: Social Science and Medicine
  contributor:
    fullname: Fishwick
– volume: 26
  start-page: 857
  year: 2010
  end-page: 869
  ident: bib0080
  article-title: How shall I trust the faceless and the intangible?. A literature review on the an-tecedents of online trust
  publication-title: Computers in Human Behavior
  contributor:
    fullname: Steehouder
– volume: 34
  start-page: 1
  year: 2002
  end-page: 47
  ident: bib0050
  article-title: Machine Learning in Automated Text Cate-gorization
  publication-title: ACM Computing Surveys
  contributor:
    fullname: Sebastiani
– volume: 1
  start-page: 116
  year: 1998
  end-page: 122
  ident: bib0060
  article-title: An experimental evaluation of OCR text representations for learning document classifiers
  publication-title: In-ternational Journal on Document Analysis and Recogni-tion
  contributor:
    fullname: Hoch
– volume: 5
  start-page: 69
  year: Mar 2014
  end-page: 74
  ident: bib0005
  article-title: Feasibility of automated detection of honcode conformity for health related websites
  publication-title: IJACSA
  contributor:
    fullname: Dolamic
– volume: 33
  start-page: 495
  year: 1997
  end-page: 512
  ident: bib0105
  article-title: Statistical inference in retrieval effectiveness evaluation
  publication-title: Information Processing & Manage-ment
  contributor:
    fullname: Savoy
– start-page: 185
  year: 2007
  end-page: 189
  ident: bib0040
  article-title: Automatic Retrieval of Web Pages with Standards of Ethics and Trustworthiness Within a Medical Portal: What a Page Name Tells Us
  publication-title: Ar-tificial Intelligence in Medicine [Internet]. Springer
  contributor:
    fullname: Boyer
– volume: 17
  start-page: e135
  year: 2015
  ident: bib0110
  article-title: Automated Detection of HONcode Website Conformity Compared to Manual Detection: An Evaluation
  publication-title: J Med Internet Res
  contributor:
    fullname: Dolamic
– ident: 10.1016/j.procs.2015.08.484_bib0030
  doi: 10.1145/1390334.1390518
– volume: 5
  start-page: 69
  issue: 3
  year: 2014
  ident: 10.1016/j.procs.2015.08.484_bib0005
  article-title: Feasibility of automated detection of honcode conformity for health related websites
  publication-title: IJACSA
  doi: 10.14569/IJACSA.2014.050309
  contributor:
    fullname: Boyer
– ident: 10.1016/j.procs.2015.08.484_bib0010
  doi: 10.4066/AMJ.2014.1900
– start-page: 185
  year: 2007
  ident: 10.1016/j.procs.2015.08.484_bib0040
  article-title: Automatic Retrieval of Web Pages with Standards of Ethics and Trustworthiness Within a Medical Portal: What a Page Name Tells Us
  publication-title: Ar-tificial Intelligence in Medicine [Internet]. Springer
  doi: 10.1007/978-3-540-73599-1_24
  contributor:
    fullname: Gaudinat
– volume: 1
  start-page: 116
  issue: 2
  year: 1998
  ident: 10.1016/j.procs.2015.08.484_bib0060
  article-title: An experimental evaluation of OCR text representations for learning document classifiers
  publication-title: In-ternational Journal on Document Analysis and Recogni-tion
  doi: 10.1007/s100320050012
  contributor:
    fullname: Junker
– ident: 10.1016/j.procs.2015.08.484_bib0075
– volume: 26
  start-page: 857
  issue: 5
  year: 2010
  ident: 10.1016/j.procs.2015.08.484_bib0080
  article-title: How shall I trust the faceless and the intangible?. A literature review on the an-tecedents of online trust
  publication-title: Computers in Human Behavior
  doi: 10.1016/j.chb.2010.03.013
  contributor:
    fullname: Beldad
– ident: 10.1016/j.procs.2015.08.484_bib0085
– ident: 10.1016/j.procs.2015.08.484_bib0035
– ident: 10.1016/j.procs.2015.08.484_bib0015
– volume: 33
  start-page: 495
  issue: 4
  year: 1997
  ident: 10.1016/j.procs.2015.08.484_bib0105
  article-title: Statistical inference in retrieval effectiveness evaluation
  publication-title: Information Processing & Manage-ment
  doi: 10.1016/S0306-4573(97)00027-7
  contributor:
    fullname: Savoy
– ident: 10.1016/j.procs.2015.08.484_bib0070
  doi: 10.1007/978-3-540-30222-3_27
– ident: 10.1016/j.procs.2015.08.484_bib0100
  doi: 10.3115/1572433.1572439
– ident: 10.1016/j.procs.2015.08.484_bib0045
– ident: 10.1016/j.procs.2015.08.484_bib0055
  doi: 10.1017/CBO9780511809071
– ident: 10.1016/j.procs.2015.08.484_bib0095
– ident: 10.1016/j.procs.2015.08.484_bib0065
– volume: 64
  start-page: 1853
  issue: 9
  year: 2007
  ident: 10.1016/j.procs.2015.08.484_bib0090
  article-title: How do pa-tients evaluate and make use of online health information?
  publication-title: Social Science and Medicine
  doi: 10.1016/j.socscimed.2007.01.012
  contributor:
    fullname: Sillence
– volume: 7
  start-page: 73
  issue: 1-2
  year: 2004
  ident: 10.1016/j.procs.2015.08.484_bib0025
  article-title: Character n-gram tokenization for european language text retrieval
  publication-title: Information Retrieval
  doi: 10.1023/B:INRT.0000009441.78971.be
  contributor:
    fullname: Mc Namee
– ident: 10.1016/j.procs.2015.08.484_bib0020
– volume: 34
  start-page: 1
  year: 2002
  ident: 10.1016/j.procs.2015.08.484_bib0050
  article-title: Machine Learning in Automated Text Cate-gorization
  publication-title: ACM Computing Surveys
  doi: 10.1145/505282.505283
  contributor:
    fullname: Sebastiani
– volume: 17
  start-page: e135
  issue: 6
  year: 2015
  ident: 10.1016/j.procs.2015.08.484_bib0110
  article-title: Automated Detection of HONcode Website Conformity Compared to Manual Detection: An Evaluation
  publication-title: J Med Internet Res
  doi: 10.2196/jmir.3831
  contributor:
    fullname: Boyer
SSID ssj0000388917
Score 2.0404854
Snippet Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of...
SourceID crossref
elsevier
SourceType Aggregation Database
Publisher
StartPage 224
SubjectTerms HONcode
Machine learning
N-gram
Title Language Independent Tokenization vs. Stemming in Automated Detection of Health Websites’ HONcode Conformity: An Evaluation
URI https://dx.doi.org/10.1016/j.procs.2015.08.484
Volume 64
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3LSsNAFB2Kbtz4Fp_lLlwam3TymLiLtaW-KljF7kLmBVWaiKaCC8Hf8Pf8EufmURTEhcs8BsJh5t4zk3vOJWRfyEBIrYXFE2E2KNpXFje0w_KDNhPSPNSFldLlwO_fumcjb9QgnVoLg2WVVewvY3oRras7rQrN1uN43Bo6LAjQvcRQGtwGoOAXy9JRxDc6np2zoNtJWDTexfctHFCbDxVlXpgn0Lbb8dDK02Xu7wnqW9LpLZPFii1CVH7QCmmodJUs1Z0YoFqYa-Ttojp2hNNZW9scbrIHVess4eX5EIa5mkxMroJxCtE0zwxbVRJOVF7UY6WQaShlSXCnOP5Wfv58_4D-1QCF74DiwAz1_69HEKXQnfmEr5PbXvem07eqxgqWoC7LLe0JqoOkHQQsdLVkknqCmZ1SyHybGgxZwnXCpQgdxV1f-AHnNg0TZZK5MozEpRtkLs1StUmA0YTJkDq-kMpVXoK9kGzqKd02XELbfIsc1GjGj6V_RlwXlt3HBfgxgh_bLDbgbxG_Rjz-MQ1iE-H_Grj934E7ZAGvylOVXTKXP03VnuEZOW-S-ej8-u68WUyoLyp51iU
link.rule.ids 315,783,787,3514,4032,27936,27937,27938,45887
linkProvider Elsevier
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV25TsNAEF1xFNBwI8I5BSUmdtbHmi5cSiCEgiDSrbyXFBA2AoNEgcRv8Ht8CTs-IpAQBa1XI1mj9cyb8bw3hOxKFUlljHREIm2BYkLtCAs7nDBqMansoSmklC76YefaPxsGwwlyVHNhcKyyiv1lTC-idfWkWXmz-TAaNa88FkWoXmIhDZYBbJJMI-0SK7Du8HDcaEG5k7jYvIsGDlrU6kPFnBcmCtTt9gLU8vSZ_3uG-pZ1ThfIXAUXoV2-0SKZ0OkSma9XMUD1ZS6Tt17Vd4TueK9tDoPsTtdES3h52oerXN_f22QFoxTaz3lm4apWcKzzYiArhcxAyUuCGy3wv_LT5_sHdC77yHwHZAdmKADwegDtFE7GQuEr5Pr0ZHDUcarNCo6kPssdE0hqoqQVRSz2jWKKBpLZUilmoUtp4LFEmEQoGXta-KEMIyFcGifaZnNtIYlPV8lUmqV6jQCjCVMx9UKptK-DBJchuTTQpmXBhHFFg-zV3uQPpYAGryfLbnnhfI7O5y7j1vkNEtYe5z_uAbch_i_D9f8a7pCZzuCix3vd_vkGmcWTssWySabyx2e9ZUFHLraLS_UFXBfXsA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Language+Independent+Tokenization+vs.+Stemming+in+Automated+Detection+of+Health+Websites%E2%80%99+HONcode+Conformity%3A+An+Evaluation&rft.jtitle=Procedia+computer+science&rft.au=Boyer%2C+C%C3%A9lia&rft.au=Dolamic%2C+Ljiljana&rft.au=Falquet%2C+Gilles&rft.date=2015&rft.issn=1877-0509&rft.eissn=1877-0509&rft.volume=64&rft.spage=224&rft.epage=231&rft_id=info:doi/10.1016%2Fj.procs.2015.08.484&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_procs_2015_08_484
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1877-0509&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1877-0509&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1877-0509&client=summon