Language Independent Tokenization vs. Stemming in Automated Detection of Health Websites’ HONcode Conformity: An Evaluation

Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of conduct (www.hon.ch/Conduct.html). The current study used varying length character n-gram vectors to represent healthcare web page documents –...

Full description

Saved in:

Bibliographic Details
Published in	Procedia computer science Vol. 64; pp. 224 - 231
Main Authors	Boyer, Célia, Dolamic, Ljiljana, Falquet, Gilles
Format	Journal Article
Language	English
Published	Elsevier B.V 2015
Subjects	HONcode Machine learning N-gram HONcode N-gram Machine learning
Online Access	Get full text

Cover

Loading…

Abstract	Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of conduct (www.hon.ch/Conduct.html). The current study used varying length character n-gram vectors to represent healthcare web page documents – not the traditional approach of using word vectors. The training/testing collection comprised web page fragments that HONcode experts had cited as the basis for individual HONcode compliance during the manual certification process (described below). The authors compared automated classification performance of n-gram tokenization to the automated classification performance of document words and Porter-stemmed document words using a Naive Bayes classifier and DF (document frequency) dimensionality reduction metrics. The study attempted to determine whether the automated, language-independent approach might safely replace single word-based classification. Using 5-grams as document features, authors also compared the baseline DF reduction function to Chi-square and Z-score dimensionality reductions. While the Z-score approach statistically significantly improved precision for some HONcode compliance components, the Chi-square performance was unreliable, performing very well for some criteria and poorly for others. Overall study results indicate that n-gram tokenization provide a potentially viable alternative to document word stemming.
AbstractList	Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of conduct (www.hon.ch/Conduct.html). The current study used varying length character n-gram vectors to represent healthcare web page documents – not the traditional approach of using word vectors. The training/testing collection comprised web page fragments that HONcode experts had cited as the basis for individual HONcode compliance during the manual certification process (described below). The authors compared automated classification performance of n-gram tokenization to the automated classification performance of document words and Porter-stemmed document words using a Naive Bayes classifier and DF (document frequency) dimensionality reduction metrics. The study attempted to determine whether the automated, language-independent approach might safely replace single word-based classification. Using 5-grams as document features, authors also compared the baseline DF reduction function to Chi-square and Z-score dimensionality reductions. While the Z-score approach statistically significantly improved precision for some HONcode compliance components, the Chi-square performance was unreliable, performing very well for some criteria and poorly for others. Overall study results indicate that n-gram tokenization provide a potentially viable alternative to document word stemming.
Author	Falquet, Gilles Dolamic, Ljiljana Boyer, Célia
Author_xml	– sequence: 1 givenname: Célia surname: Boyer fullname: Boyer, Célia email: celia.boyer@healthonnet.org organization: Health on the Net Foundation, Geneva, Switzerland – sequence: 2 givenname: Ljiljana surname: Dolamic fullname: Dolamic, Ljiljana organization: Health on the Net Foundation, Geneva, Switzerland – sequence: 3 givenname: Gilles surname: Falquet fullname: Falquet, Gilles organization: University of Geneva, Geneva, Switzerland
BookMark	eNp9kE1OwzAQhS1UJErpCdj4Agl2ncQOEouq_LRSRRcUsbQcZ1JcGruK3UpFQuIaXI-TkLYsWDGLmVnMe3rznaOOdRYQuqQkpoRmV8t43Tjt4wGhaUxEnIjkBHWp4DwiKck7f_Yz1Pd-SdpiQuSUd9HHVNnFRi0AT2wJa2ibDXju3sCadxWMs3jrY_wUoK6NXWBj8XATXK0ClPgWAujDjavwGNQqvOIXKLwJ4L8_v_B49qhdCXjkbOWa2oTdNR5afLdVq83B-wKdVmrlof87e-j5_m4-GkfT2cNkNJxGmiUiRFWqWcXVgHORJ1UpSpZqkeRpLjLCWEqFKipVlDqnUCSZznhREJYrGGQZcEoS1kPs6Ksb530DlVw3plbNTlIi9xDlUh4gyj1ESYRsIbaqm6MK2mhbA4302oDVUJqm_VuWzvyr_wHEnYB0
CitedBy_id	crossref_primary_10_1108_OIR_01_2017_0028 crossref_primary_10_2196_52995
Cites_doi	10.1145/1390334.1390518 10.14569/IJACSA.2014.050309 10.4066/AMJ.2014.1900 10.1007/978-3-540-73599-1_24 10.1007/s100320050012 10.1016/j.chb.2010.03.013 10.1016/S0306-4573(97)00027-7 10.1007/978-3-540-30222-3_27 10.3115/1572433.1572439 10.1017/CBO9780511809071 10.1016/j.socscimed.2007.01.012 10.1023/B:INRT.0000009441.78971.be 10.1145/505282.505283 10.2196/jmir.3831
ContentType	Journal Article
Copyright	2015 The Authors
Copyright_xml	– notice: 2015 The Authors
DBID	6I. AAFTH AAYXX CITATION
DOI	10.1016/j.procs.2015.08.484
DatabaseName	ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef
DatabaseTitle	CrossRef
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	1877-0509
EndPage	231
ExternalDocumentID	10_1016_j_procs_2015_08_484 S1877050915026198
GroupedDBID	--K 0R~ 0SF 1B1 457 5VS 6I. 71M AACTN AAEDT AAEDW AAFTH AAIKJ AALRI AAQFI AAXUO ABMAC ACGFS ADBBV ADEZE AEXQZ AFTJW AGHFR AITUG ALMA_UNASSIGNED_HOLDINGS AMRAJ E3Z EBS EJD EP3 FDB FNPLU HZ~ IXB KQ8 M41 M~E NCXOZ O-L O9- OK1 P2P RIG ROL SES SSZ AAYXX ADVLN AKRWK CITATION
ID	FETCH-LOGICAL-c348t-f5c3f7a277894fd8d35c8495986033518abfabdc91eb46c67bb039ae266e71043
IEDL.DBID	IXB
ISSN	1877-0509
IngestDate	Fri Aug 23 00:54:33 EDT 2024 Wed May 17 00:08:13 EDT 2023
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Keywords	HONcode N-gram Machine learning
Language	English
License	http://creativecommons.org/licenses/by-nc-nd/4.0
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c348t-f5c3f7a277894fd8d35c8495986033518abfabdc91eb46c67bb039ae266e71043
OpenAccessLink	https://www.sciencedirect.com/science/article/pii/S1877050915026198
PageCount	8
ParticipantIDs	crossref_primary_10_1016_j_procs_2015_08_484 elsevier_sciencedirect_doi_10_1016_j_procs_2015_08_484
PublicationCentury	2000
PublicationDate	2015 2015-00-00
PublicationDateYYYYMMDD	2015-01-01
PublicationDate_xml	– year: 2015 text: 2015
PublicationDecade	2010
PublicationTitle	Procedia computer science
PublicationYear	2015
Publisher	Elsevier B.V
Publisher_xml	– name: Elsevier B.V
References	Fahy E., Hardikar R., Fox A., Mackay S. Quality of patient health information on the Internet: reviewing a complex and evolving landscape. Australasian Med J. 2014; 7(1) 24-28. PMID: 24567763. Mc Namee P., Mayfield J., Nicholas C.K. Don’t have a stemmer?: be un+concern+ed. In Sung-Hyon Myaeng, Douglas W. Oard, Fabrizio Sebastiani, Tat-Seng Chua, and Mun-Kew Leong, editors, SIGIR, ACM, 2008; 813-814. Eysenbach G., Powell J., Kuss O., Sa ER. Empirical Studies Assessing the Quality of Health Information for Consumers on the World Wide Web – A Systematic Review Journal of the American Medical Association JAMA 2002; 287(20):2691-2700. Manning C.D., Raghavan P., Schutze H. Introduction to information retrieval. Cambridge University Press, 2008. Sebastiani (bib0050) 2002; 34 Mc Namee, Mayfield (bib0025) 2004; 7 Sillence, Briggs, Harris, Fishwick (bib0090) 2007; 64 Tomlinson, S. (2004). Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServerTM at CLEF 2003. In Comparative evalua-tion of multilingual information access systems. LNCS #3237 (pp. 286-300). Berlin: Springer-Verlag. Zubaryeva O., Savoy J. Investigation in statistical lan-guage-independent approaches for opinion detection in English, Chinese and Japanese. In Proceedings of the Third International Workshop on Cross Lingual Information Ac-cess: Addressing the Information Need of Multilingual So-cieties (CLIAWS3 ‘09). Association for Computational Linguistics, Stroudsburg, PA, USA, 2009;38-45. Gaudinat, Grabar, Boyer (bib0040) 2007 Boyer, Dolamic (bib0110) 2015; 17 Savoy (bib0105) 1997; 33 Williams K., Calvo R.A. A framework for document catego-rization. 7th Australasian Document Computing Symposi-um. December 2002. Sydney, Australia. 13-19. Boyer, Dolamic (bib0005) Mar 2014; 5 Gaudinat A., Grabar N., Boyer C. Machine learning ap-proach for automatic quality criteria detection of health web pages. In Klaus A. Kuhn, James R. Warren, and Tze-Yun Leong, editors, MedInfo, Studies in Health Technolo-gy and Informatics, IOS Press, 2007; 129:705-709. Baeza-Yates R.A., Ribeiro-Neto B. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. 1999. Cavnar W.B., Trenkle J.M. N-Gram-Based Text Catego-rization, In Proceedings of SDAIR-94, 3rd Annual Sympo-sium on Document Analysis and Information Retrieval, 1994, 161-175. European citizens’ digital health literacy, 2014-12-04. URL:http://ec.europa.eu/public_opinion/flash/fl_404_en.pdf. Beldad, de Jong, Steehouder (bib0080) 2010; 26 Naji N., Savoy J., Dolamic L. Recherche d’information dans un corpus bruité(ocr). In Gabriella Pasiand, Patrice Bellot, editors, CORIA, 6 Editions Universitaires d’Avignon, 2011; 271-28. Boyer C., Baujard V., Scherrer J. HONcode: a standard to improve the quality of medical/health information on the internet and HON's 5th survey on the use of internet for medical and health purposes. In 6th Internet World Con-gress for Biomedical Sciences (INABIS 2000), 1999. Junker, Hoch (bib0060) 1998; 1 Boyer (10.1016/j.procs.2015.08.484_bib0005) 2014; 5 Beldad (10.1016/j.procs.2015.08.484_bib0080) 2010; 26 Sebastiani (10.1016/j.procs.2015.08.484_bib0050) 2002; 34 10.1016/j.procs.2015.08.484_bib0055 10.1016/j.procs.2015.08.484_bib0010 10.1016/j.procs.2015.08.484_bib0065 Boyer (10.1016/j.procs.2015.08.484_bib0110) 2015; 17 10.1016/j.procs.2015.08.484_bib0035 Gaudinat (10.1016/j.procs.2015.08.484_bib0040) 2007 10.1016/j.procs.2015.08.484_bib0045 10.1016/j.procs.2015.08.484_bib0100 10.1016/j.procs.2015.08.484_bib0015 Savoy (10.1016/j.procs.2015.08.484_bib0105) 1997; 33 10.1016/j.procs.2015.08.484_bib0070 10.1016/j.procs.2015.08.484_bib0095 10.1016/j.procs.2015.08.484_bib0020 10.1016/j.procs.2015.08.484_bib0075 10.1016/j.procs.2015.08.484_bib0030 10.1016/j.procs.2015.08.484_bib0085 Junker (10.1016/j.procs.2015.08.484_bib0060) 1998; 1 Sillence (10.1016/j.procs.2015.08.484_bib0090) 2007; 64 Mc Namee (10.1016/j.procs.2015.08.484_bib0025) 2004; 7
References_xml	– volume: 7 start-page: 73 year: 2004 end-page: 97 ident: bib0025 article-title: Character n-gram tokenization for european language text retrieval publication-title: Information Retrieval contributor: fullname: Mayfield – volume: 64 start-page: 1853 year: 2007 end-page: 1862 ident: bib0090 article-title: How do pa-tients evaluate and make use of online health information? publication-title: Social Science and Medicine contributor: fullname: Fishwick – volume: 26 start-page: 857 year: 2010 end-page: 869 ident: bib0080 article-title: How shall I trust the faceless and the intangible?. A literature review on the an-tecedents of online trust publication-title: Computers in Human Behavior contributor: fullname: Steehouder – volume: 34 start-page: 1 year: 2002 end-page: 47 ident: bib0050 article-title: Machine Learning in Automated Text Cate-gorization publication-title: ACM Computing Surveys contributor: fullname: Sebastiani – volume: 1 start-page: 116 year: 1998 end-page: 122 ident: bib0060 article-title: An experimental evaluation of OCR text representations for learning document classifiers publication-title: In-ternational Journal on Document Analysis and Recogni-tion contributor: fullname: Hoch – volume: 5 start-page: 69 year: Mar 2014 end-page: 74 ident: bib0005 article-title: Feasibility of automated detection of honcode conformity for health related websites publication-title: IJACSA contributor: fullname: Dolamic – volume: 33 start-page: 495 year: 1997 end-page: 512 ident: bib0105 article-title: Statistical inference in retrieval effectiveness evaluation publication-title: Information Processing & Manage-ment contributor: fullname: Savoy – start-page: 185 year: 2007 end-page: 189 ident: bib0040 article-title: Automatic Retrieval of Web Pages with Standards of Ethics and Trustworthiness Within a Medical Portal: What a Page Name Tells Us publication-title: Ar-tificial Intelligence in Medicine [Internet]. Springer contributor: fullname: Boyer – volume: 17 start-page: e135 year: 2015 ident: bib0110 article-title: Automated Detection of HONcode Website Conformity Compared to Manual Detection: An Evaluation publication-title: J Med Internet Res contributor: fullname: Dolamic – ident: 10.1016/j.procs.2015.08.484_bib0030 doi: 10.1145/1390334.1390518 – volume: 5 start-page: 69 issue: 3 year: 2014 ident: 10.1016/j.procs.2015.08.484_bib0005 article-title: Feasibility of automated detection of honcode conformity for health related websites publication-title: IJACSA doi: 10.14569/IJACSA.2014.050309 contributor: fullname: Boyer – ident: 10.1016/j.procs.2015.08.484_bib0010 doi: 10.4066/AMJ.2014.1900 – start-page: 185 year: 2007 ident: 10.1016/j.procs.2015.08.484_bib0040 article-title: Automatic Retrieval of Web Pages with Standards of Ethics and Trustworthiness Within a Medical Portal: What a Page Name Tells Us publication-title: Ar-tificial Intelligence in Medicine [Internet]. Springer doi: 10.1007/978-3-540-73599-1_24 contributor: fullname: Gaudinat – volume: 1 start-page: 116 issue: 2 year: 1998 ident: 10.1016/j.procs.2015.08.484_bib0060 article-title: An experimental evaluation of OCR text representations for learning document classifiers publication-title: In-ternational Journal on Document Analysis and Recogni-tion doi: 10.1007/s100320050012 contributor: fullname: Junker – ident: 10.1016/j.procs.2015.08.484_bib0075 – volume: 26 start-page: 857 issue: 5 year: 2010 ident: 10.1016/j.procs.2015.08.484_bib0080 article-title: How shall I trust the faceless and the intangible?. A literature review on the an-tecedents of online trust publication-title: Computers in Human Behavior doi: 10.1016/j.chb.2010.03.013 contributor: fullname: Beldad – ident: 10.1016/j.procs.2015.08.484_bib0085 – ident: 10.1016/j.procs.2015.08.484_bib0035 – ident: 10.1016/j.procs.2015.08.484_bib0015 – volume: 33 start-page: 495 issue: 4 year: 1997 ident: 10.1016/j.procs.2015.08.484_bib0105 article-title: Statistical inference in retrieval effectiveness evaluation publication-title: Information Processing & Manage-ment doi: 10.1016/S0306-4573(97)00027-7 contributor: fullname: Savoy – ident: 10.1016/j.procs.2015.08.484_bib0070 doi: 10.1007/978-3-540-30222-3_27 – ident: 10.1016/j.procs.2015.08.484_bib0100 doi: 10.3115/1572433.1572439 – ident: 10.1016/j.procs.2015.08.484_bib0045 – ident: 10.1016/j.procs.2015.08.484_bib0055 doi: 10.1017/CBO9780511809071 – ident: 10.1016/j.procs.2015.08.484_bib0095 – ident: 10.1016/j.procs.2015.08.484_bib0065 – volume: 64 start-page: 1853 issue: 9 year: 2007 ident: 10.1016/j.procs.2015.08.484_bib0090 article-title: How do pa-tients evaluate and make use of online health information? publication-title: Social Science and Medicine doi: 10.1016/j.socscimed.2007.01.012 contributor: fullname: Sillence – volume: 7 start-page: 73 issue: 1-2 year: 2004 ident: 10.1016/j.procs.2015.08.484_bib0025 article-title: Character n-gram tokenization for european language text retrieval publication-title: Information Retrieval doi: 10.1023/B:INRT.0000009441.78971.be contributor: fullname: Mc Namee – ident: 10.1016/j.procs.2015.08.484_bib0020 – volume: 34 start-page: 1 year: 2002 ident: 10.1016/j.procs.2015.08.484_bib0050 article-title: Machine Learning in Automated Text Cate-gorization publication-title: ACM Computing Surveys doi: 10.1145/505282.505283 contributor: fullname: Sebastiani – volume: 17 start-page: e135 issue: 6 year: 2015 ident: 10.1016/j.procs.2015.08.484_bib0110 article-title: Automated Detection of HONcode Website Conformity Compared to Manual Detection: An Evaluation publication-title: J Med Internet Res doi: 10.2196/jmir.3831 contributor: fullname: Boyer
SSID	ssj0000388917
Score	2.0404854
Snippet	Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of...
SourceID	crossref elsevier
SourceType	Aggregation Database Publisher
StartPage	224
SubjectTerms	HONcode Machine learning N-gram
Title	Language Independent Tokenization vs. Stemming in Automated Detection of Health Websites’ HONcode Conformity: An Evaluation
URI	https://dx.doi.org/10.1016/j.procs.2015.08.484
Volume	64
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3LSsNAFB2Kbtz4Fp_lLlwam3TymLiLtaW-KljF7kLmBVWaiKaCC8Hf8Pf8EufmURTEhcs8BsJh5t4zk3vOJWRfyEBIrYXFE2E2KNpXFje0w_KDNhPSPNSFldLlwO_fumcjb9QgnVoLg2WVVewvY3oRras7rQrN1uN43Bo6LAjQvcRQGtwGoOAXy9JRxDc6np2zoNtJWDTexfctHFCbDxVlXpgn0Lbb8dDK02Xu7wnqW9LpLZPFii1CVH7QCmmodJUs1Z0YoFqYa-Ttojp2hNNZW9scbrIHVess4eX5EIa5mkxMroJxCtE0zwxbVRJOVF7UY6WQaShlSXCnOP5Wfv58_4D-1QCF74DiwAz1_69HEKXQnfmEr5PbXvem07eqxgqWoC7LLe0JqoOkHQQsdLVkknqCmZ1SyHybGgxZwnXCpQgdxV1f-AHnNg0TZZK5MozEpRtkLs1StUmA0YTJkDq-kMpVXoK9kGzqKd02XELbfIsc1GjGj6V_RlwXlt3HBfgxgh_bLDbgbxG_Rjz-MQ1iE-H_Grj934E7ZAGvylOVXTKXP03VnuEZOW-S-ej8-u68WUyoLyp51iU
link.rule.ids	315,783,787,3514,4032,27936,27937,27938,45887
linkProvider	Elsevier
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV25TsNAEF1xFNBwI8I5BSUmdtbHmi5cSiCEgiDSrbyXFBA2AoNEgcRv8Ht8CTs-IpAQBa1XI1mj9cyb8bw3hOxKFUlljHREIm2BYkLtCAs7nDBqMansoSmklC76YefaPxsGwwlyVHNhcKyyiv1lTC-idfWkWXmz-TAaNa88FkWoXmIhDZYBbJJMI-0SK7Du8HDcaEG5k7jYvIsGDlrU6kPFnBcmCtTt9gLU8vSZ_3uG-pZ1ThfIXAUXoV2-0SKZ0OkSma9XMUD1ZS6Tt17Vd4TueK9tDoPsTtdES3h52oerXN_f22QFoxTaz3lm4apWcKzzYiArhcxAyUuCGy3wv_LT5_sHdC77yHwHZAdmKADwegDtFE7GQuEr5Pr0ZHDUcarNCo6kPssdE0hqoqQVRSz2jWKKBpLZUilmoUtp4LFEmEQoGXta-KEMIyFcGifaZnNtIYlPV8lUmqV6jQCjCVMx9UKptK-DBJchuTTQpmXBhHFFg-zV3uQPpYAGryfLbnnhfI7O5y7j1vkNEtYe5z_uAbch_i_D9f8a7pCZzuCix3vd_vkGmcWTssWySabyx2e9ZUFHLraLS_UFXBfXsA
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Language+Independent+Tokenization+vs.+Stemming+in+Automated+Detection+of+Health+Websites%E2%80%99+HONcode+Conformity%3A+An+Evaluation&rft.jtitle=Procedia+computer+science&rft.au=Boyer%2C+C%C3%A9lia&rft.au=Dolamic%2C+Ljiljana&rft.au=Falquet%2C+Gilles&rft.date=2015&rft.issn=1877-0509&rft.eissn=1877-0509&rft.volume=64&rft.spage=224&rft.epage=231&rft_id=info:doi/10.1016%2Fj.procs.2015.08.484&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_procs_2015_08_484
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1877-0509&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1877-0509&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1877-0509&client=summon