Language Independent Tokenization vs. Stemming in Automated Detection of Health Websites’ HONcode Conformity: An Evaluation
Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of conduct (www.hon.ch/Conduct.html). The current study used varying length character n-gram vectors to represent healthcare web page documents –...
Saved in:
Published in | Procedia computer science Vol. 64; pp. 224 - 231 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
2015
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of conduct (www.hon.ch/Conduct.html). The current study used varying length character n-gram vectors to represent healthcare web page documents – not the traditional approach of using word vectors. The training/testing collection comprised web page fragments that HONcode experts had cited as the basis for individual HONcode compliance during the manual certification process (described below). The authors compared automated classification performance of n-gram tokenization to the automated classification performance of document words and Porter-stemmed document words using a Naive Bayes classifier and DF (document frequency) dimensionality reduction metrics. The study attempted to determine whether the automated, language-independent approach might safely replace single word-based classification. Using 5-grams as document features, authors also compared the baseline DF reduction function to Chi-square and Z-score dimensionality reductions. While the Z-score approach statistically significantly improved precision for some HONcode compliance components, the Chi-square performance was unreliable, performing very well for some criteria and poorly for others. Overall study results indicate that n-gram tokenization provide a potentially viable alternative to document word stemming. |
---|---|
AbstractList | Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of conduct (www.hon.ch/Conduct.html). The current study used varying length character n-gram vectors to represent healthcare web page documents – not the traditional approach of using word vectors. The training/testing collection comprised web page fragments that HONcode experts had cited as the basis for individual HONcode compliance during the manual certification process (described below). The authors compared automated classification performance of n-gram tokenization to the automated classification performance of document words and Porter-stemmed document words using a Naive Bayes classifier and DF (document frequency) dimensionality reduction metrics. The study attempted to determine whether the automated, language-independent approach might safely replace single word-based classification. Using 5-grams as document features, authors also compared the baseline DF reduction function to Chi-square and Z-score dimensionality reductions. While the Z-score approach statistically significantly improved precision for some HONcode compliance components, the Chi-square performance was unreliable, performing very well for some criteria and poorly for others. Overall study results indicate that n-gram tokenization provide a potentially viable alternative to document word stemming. |
Author | Falquet, Gilles Dolamic, Ljiljana Boyer, Célia |
Author_xml | – sequence: 1 givenname: Célia surname: Boyer fullname: Boyer, Célia email: celia.boyer@healthonnet.org organization: Health on the Net Foundation, Geneva, Switzerland – sequence: 2 givenname: Ljiljana surname: Dolamic fullname: Dolamic, Ljiljana organization: Health on the Net Foundation, Geneva, Switzerland – sequence: 3 givenname: Gilles surname: Falquet fullname: Falquet, Gilles organization: University of Geneva, Geneva, Switzerland |
BookMark | eNp9kE1OwzAQhS1UJErpCdj4Agl2ncQOEouq_LRSRRcUsbQcZ1JcGruK3UpFQuIaXI-TkLYsWDGLmVnMe3rznaOOdRYQuqQkpoRmV8t43Tjt4wGhaUxEnIjkBHWp4DwiKck7f_Yz1Pd-SdpiQuSUd9HHVNnFRi0AT2wJa2ibDXju3sCadxWMs3jrY_wUoK6NXWBj8XATXK0ClPgWAujDjavwGNQqvOIXKLwJ4L8_v_B49qhdCXjkbOWa2oTdNR5afLdVq83B-wKdVmrlof87e-j5_m4-GkfT2cNkNJxGmiUiRFWqWcXVgHORJ1UpSpZqkeRpLjLCWEqFKipVlDqnUCSZznhREJYrGGQZcEoS1kPs6Ksb530DlVw3plbNTlIi9xDlUh4gyj1ESYRsIbaqm6MK2mhbA4302oDVUJqm_VuWzvyr_wHEnYB0 |
CitedBy_id | crossref_primary_10_1108_OIR_01_2017_0028 crossref_primary_10_2196_52995 |
Cites_doi | 10.1145/1390334.1390518 10.14569/IJACSA.2014.050309 10.4066/AMJ.2014.1900 10.1007/978-3-540-73599-1_24 10.1007/s100320050012 10.1016/j.chb.2010.03.013 10.1016/S0306-4573(97)00027-7 10.1007/978-3-540-30222-3_27 10.3115/1572433.1572439 10.1017/CBO9780511809071 10.1016/j.socscimed.2007.01.012 10.1023/B:INRT.0000009441.78971.be 10.1145/505282.505283 10.2196/jmir.3831 |
ContentType | Journal Article |
Copyright | 2015 The Authors |
Copyright_xml | – notice: 2015 The Authors |
DBID | 6I. AAFTH AAYXX CITATION |
DOI | 10.1016/j.procs.2015.08.484 |
DatabaseName | ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef |
DatabaseTitle | CrossRef |
DatabaseTitleList | |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISSN | 1877-0509 |
EndPage | 231 |
ExternalDocumentID | 10_1016_j_procs_2015_08_484 S1877050915026198 |
GroupedDBID | --K 0R~ 0SF 1B1 457 5VS 6I. 71M AACTN AAEDT AAEDW AAFTH AAIKJ AALRI AAQFI AAXUO ABMAC ACGFS ADBBV ADEZE AEXQZ AFTJW AGHFR AITUG ALMA_UNASSIGNED_HOLDINGS AMRAJ E3Z EBS EJD EP3 FDB FNPLU HZ~ IXB KQ8 M41 M~E NCXOZ O-L O9- OK1 P2P RIG ROL SES SSZ AAYXX ADVLN AKRWK CITATION |
ID | FETCH-LOGICAL-c348t-f5c3f7a277894fd8d35c8495986033518abfabdc91eb46c67bb039ae266e71043 |
IEDL.DBID | IXB |
ISSN | 1877-0509 |
IngestDate | Fri Aug 23 00:54:33 EDT 2024 Wed May 17 00:08:13 EDT 2023 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Keywords | HONcode N-gram Machine learning |
Language | English |
License | http://creativecommons.org/licenses/by-nc-nd/4.0 |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c348t-f5c3f7a277894fd8d35c8495986033518abfabdc91eb46c67bb039ae266e71043 |
OpenAccessLink | https://www.sciencedirect.com/science/article/pii/S1877050915026198 |
PageCount | 8 |
ParticipantIDs | crossref_primary_10_1016_j_procs_2015_08_484 elsevier_sciencedirect_doi_10_1016_j_procs_2015_08_484 |
PublicationCentury | 2000 |
PublicationDate | 2015 2015-00-00 |
PublicationDateYYYYMMDD | 2015-01-01 |
PublicationDate_xml | – year: 2015 text: 2015 |
PublicationDecade | 2010 |
PublicationTitle | Procedia computer science |
PublicationYear | 2015 |
Publisher | Elsevier B.V |
Publisher_xml | – name: Elsevier B.V |
References | Fahy E., Hardikar R., Fox A., Mackay S. Quality of patient health information on the Internet: reviewing a complex and evolving landscape. Australasian Med J. 2014; 7(1) 24-28. PMID: 24567763. Mc Namee P., Mayfield J., Nicholas C.K. Don’t have a stemmer?: be un+concern+ed. In Sung-Hyon Myaeng, Douglas W. Oard, Fabrizio Sebastiani, Tat-Seng Chua, and Mun-Kew Leong, editors, SIGIR, ACM, 2008; 813-814. Eysenbach G., Powell J., Kuss O., Sa ER. Empirical Studies Assessing the Quality of Health Information for Consumers on the World Wide Web – A Systematic Review Journal of the American Medical Association JAMA 2002; 287(20):2691-2700. Manning C.D., Raghavan P., Schutze H. Introduction to information retrieval. Cambridge University Press, 2008. Sebastiani (bib0050) 2002; 34 Mc Namee, Mayfield (bib0025) 2004; 7 Sillence, Briggs, Harris, Fishwick (bib0090) 2007; 64 Tomlinson, S. (2004). Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServerTM at CLEF 2003. In Comparative evalua-tion of multilingual information access systems. LNCS #3237 (pp. 286-300). Berlin: Springer-Verlag. Zubaryeva O., Savoy J. Investigation in statistical lan-guage-independent approaches for opinion detection in English, Chinese and Japanese. In Proceedings of the Third International Workshop on Cross Lingual Information Ac-cess: Addressing the Information Need of Multilingual So-cieties (CLIAWS3 ‘09). Association for Computational Linguistics, Stroudsburg, PA, USA, 2009;38-45. Gaudinat, Grabar, Boyer (bib0040) 2007 Boyer, Dolamic (bib0110) 2015; 17 Savoy (bib0105) 1997; 33 Williams K., Calvo R.A. A framework for document catego-rization. 7th Australasian Document Computing Symposi-um. December 2002. Sydney, Australia. 13-19. Boyer, Dolamic (bib0005) Mar 2014; 5 Gaudinat A., Grabar N., Boyer C. Machine learning ap-proach for automatic quality criteria detection of health web pages. In Klaus A. Kuhn, James R. Warren, and Tze-Yun Leong, editors, MedInfo, Studies in Health Technolo-gy and Informatics, IOS Press, 2007; 129:705-709. Baeza-Yates R.A., Ribeiro-Neto B. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. 1999. Cavnar W.B., Trenkle J.M. N-Gram-Based Text Catego-rization, In Proceedings of SDAIR-94, 3rd Annual Sympo-sium on Document Analysis and Information Retrieval, 1994, 161-175. European citizens’ digital health literacy, 2014-12-04. URL:http://ec.europa.eu/public_opinion/flash/fl_404_en.pdf. Beldad, de Jong, Steehouder (bib0080) 2010; 26 Naji N., Savoy J., Dolamic L. Recherche d’information dans un corpus bruité(ocr). In Gabriella Pasiand, Patrice Bellot, editors, CORIA, 6 Editions Universitaires d’Avignon, 2011; 271-28. Boyer C., Baujard V., Scherrer J. HONcode: a standard to improve the quality of medical/health information on the internet and HON's 5th survey on the use of internet for medical and health purposes. In 6th Internet World Con-gress for Biomedical Sciences (INABIS 2000), 1999. Junker, Hoch (bib0060) 1998; 1 Boyer (10.1016/j.procs.2015.08.484_bib0005) 2014; 5 Beldad (10.1016/j.procs.2015.08.484_bib0080) 2010; 26 Sebastiani (10.1016/j.procs.2015.08.484_bib0050) 2002; 34 10.1016/j.procs.2015.08.484_bib0055 10.1016/j.procs.2015.08.484_bib0010 10.1016/j.procs.2015.08.484_bib0065 Boyer (10.1016/j.procs.2015.08.484_bib0110) 2015; 17 10.1016/j.procs.2015.08.484_bib0035 Gaudinat (10.1016/j.procs.2015.08.484_bib0040) 2007 10.1016/j.procs.2015.08.484_bib0045 10.1016/j.procs.2015.08.484_bib0100 10.1016/j.procs.2015.08.484_bib0015 Savoy (10.1016/j.procs.2015.08.484_bib0105) 1997; 33 10.1016/j.procs.2015.08.484_bib0070 10.1016/j.procs.2015.08.484_bib0095 10.1016/j.procs.2015.08.484_bib0020 10.1016/j.procs.2015.08.484_bib0075 10.1016/j.procs.2015.08.484_bib0030 10.1016/j.procs.2015.08.484_bib0085 Junker (10.1016/j.procs.2015.08.484_bib0060) 1998; 1 Sillence (10.1016/j.procs.2015.08.484_bib0090) 2007; 64 Mc Namee (10.1016/j.procs.2015.08.484_bib0025) 2004; 7 |
References_xml | – volume: 7 start-page: 73 year: 2004 end-page: 97 ident: bib0025 article-title: Character n-gram tokenization for european language text retrieval publication-title: Information Retrieval contributor: fullname: Mayfield – volume: 64 start-page: 1853 year: 2007 end-page: 1862 ident: bib0090 article-title: How do pa-tients evaluate and make use of online health information? publication-title: Social Science and Medicine contributor: fullname: Fishwick – volume: 26 start-page: 857 year: 2010 end-page: 869 ident: bib0080 article-title: How shall I trust the faceless and the intangible?. A literature review on the an-tecedents of online trust publication-title: Computers in Human Behavior contributor: fullname: Steehouder – volume: 34 start-page: 1 year: 2002 end-page: 47 ident: bib0050 article-title: Machine Learning in Automated Text Cate-gorization publication-title: ACM Computing Surveys contributor: fullname: Sebastiani – volume: 1 start-page: 116 year: 1998 end-page: 122 ident: bib0060 article-title: An experimental evaluation of OCR text representations for learning document classifiers publication-title: In-ternational Journal on Document Analysis and Recogni-tion contributor: fullname: Hoch – volume: 5 start-page: 69 year: Mar 2014 end-page: 74 ident: bib0005 article-title: Feasibility of automated detection of honcode conformity for health related websites publication-title: IJACSA contributor: fullname: Dolamic – volume: 33 start-page: 495 year: 1997 end-page: 512 ident: bib0105 article-title: Statistical inference in retrieval effectiveness evaluation publication-title: Information Processing & Manage-ment contributor: fullname: Savoy – start-page: 185 year: 2007 end-page: 189 ident: bib0040 article-title: Automatic Retrieval of Web Pages with Standards of Ethics and Trustworthiness Within a Medical Portal: What a Page Name Tells Us publication-title: Ar-tificial Intelligence in Medicine [Internet]. Springer contributor: fullname: Boyer – volume: 17 start-page: e135 year: 2015 ident: bib0110 article-title: Automated Detection of HONcode Website Conformity Compared to Manual Detection: An Evaluation publication-title: J Med Internet Res contributor: fullname: Dolamic – ident: 10.1016/j.procs.2015.08.484_bib0030 doi: 10.1145/1390334.1390518 – volume: 5 start-page: 69 issue: 3 year: 2014 ident: 10.1016/j.procs.2015.08.484_bib0005 article-title: Feasibility of automated detection of honcode conformity for health related websites publication-title: IJACSA doi: 10.14569/IJACSA.2014.050309 contributor: fullname: Boyer – ident: 10.1016/j.procs.2015.08.484_bib0010 doi: 10.4066/AMJ.2014.1900 – start-page: 185 year: 2007 ident: 10.1016/j.procs.2015.08.484_bib0040 article-title: Automatic Retrieval of Web Pages with Standards of Ethics and Trustworthiness Within a Medical Portal: What a Page Name Tells Us publication-title: Ar-tificial Intelligence in Medicine [Internet]. Springer doi: 10.1007/978-3-540-73599-1_24 contributor: fullname: Gaudinat – volume: 1 start-page: 116 issue: 2 year: 1998 ident: 10.1016/j.procs.2015.08.484_bib0060 article-title: An experimental evaluation of OCR text representations for learning document classifiers publication-title: In-ternational Journal on Document Analysis and Recogni-tion doi: 10.1007/s100320050012 contributor: fullname: Junker – ident: 10.1016/j.procs.2015.08.484_bib0075 – volume: 26 start-page: 857 issue: 5 year: 2010 ident: 10.1016/j.procs.2015.08.484_bib0080 article-title: How shall I trust the faceless and the intangible?. A literature review on the an-tecedents of online trust publication-title: Computers in Human Behavior doi: 10.1016/j.chb.2010.03.013 contributor: fullname: Beldad – ident: 10.1016/j.procs.2015.08.484_bib0085 – ident: 10.1016/j.procs.2015.08.484_bib0035 – ident: 10.1016/j.procs.2015.08.484_bib0015 – volume: 33 start-page: 495 issue: 4 year: 1997 ident: 10.1016/j.procs.2015.08.484_bib0105 article-title: Statistical inference in retrieval effectiveness evaluation publication-title: Information Processing & Manage-ment doi: 10.1016/S0306-4573(97)00027-7 contributor: fullname: Savoy – ident: 10.1016/j.procs.2015.08.484_bib0070 doi: 10.1007/978-3-540-30222-3_27 – ident: 10.1016/j.procs.2015.08.484_bib0100 doi: 10.3115/1572433.1572439 – ident: 10.1016/j.procs.2015.08.484_bib0045 – ident: 10.1016/j.procs.2015.08.484_bib0055 doi: 10.1017/CBO9780511809071 – ident: 10.1016/j.procs.2015.08.484_bib0095 – ident: 10.1016/j.procs.2015.08.484_bib0065 – volume: 64 start-page: 1853 issue: 9 year: 2007 ident: 10.1016/j.procs.2015.08.484_bib0090 article-title: How do pa-tients evaluate and make use of online health information? publication-title: Social Science and Medicine doi: 10.1016/j.socscimed.2007.01.012 contributor: fullname: Sillence – volume: 7 start-page: 73 issue: 1-2 year: 2004 ident: 10.1016/j.procs.2015.08.484_bib0025 article-title: Character n-gram tokenization for european language text retrieval publication-title: Information Retrieval doi: 10.1023/B:INRT.0000009441.78971.be contributor: fullname: Mc Namee – ident: 10.1016/j.procs.2015.08.484_bib0020 – volume: 34 start-page: 1 year: 2002 ident: 10.1016/j.procs.2015.08.484_bib0050 article-title: Machine Learning in Automated Text Cate-gorization publication-title: ACM Computing Surveys doi: 10.1145/505282.505283 contributor: fullname: Sebastiani – volume: 17 start-page: e135 issue: 6 year: 2015 ident: 10.1016/j.procs.2015.08.484_bib0110 article-title: Automated Detection of HONcode Website Conformity Compared to Manual Detection: An Evaluation publication-title: J Med Internet Res doi: 10.2196/jmir.3831 contributor: fullname: Boyer |
SSID | ssj0000388917 |
Score | 2.0404854 |
Snippet | Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of... |
SourceID | crossref elsevier |
SourceType | Aggregation Database Publisher |
StartPage | 224 |
SubjectTerms | HONcode Machine learning N-gram |
Title | Language Independent Tokenization vs. Stemming in Automated Detection of Health Websites’ HONcode Conformity: An Evaluation |
URI | https://dx.doi.org/10.1016/j.procs.2015.08.484 |
Volume | 64 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3LSsNAFB2Kbtz4Fp_lLlwam3TymLiLtaW-KljF7kLmBVWaiKaCC8Hf8Pf8EufmURTEhcs8BsJh5t4zk3vOJWRfyEBIrYXFE2E2KNpXFje0w_KDNhPSPNSFldLlwO_fumcjb9QgnVoLg2WVVewvY3oRras7rQrN1uN43Bo6LAjQvcRQGtwGoOAXy9JRxDc6np2zoNtJWDTexfctHFCbDxVlXpgn0Lbb8dDK02Xu7wnqW9LpLZPFii1CVH7QCmmodJUs1Z0YoFqYa-Ttojp2hNNZW9scbrIHVess4eX5EIa5mkxMroJxCtE0zwxbVRJOVF7UY6WQaShlSXCnOP5Wfv58_4D-1QCF74DiwAz1_69HEKXQnfmEr5PbXvem07eqxgqWoC7LLe0JqoOkHQQsdLVkknqCmZ1SyHybGgxZwnXCpQgdxV1f-AHnNg0TZZK5MozEpRtkLs1StUmA0YTJkDq-kMpVXoK9kGzqKd02XELbfIsc1GjGj6V_RlwXlt3HBfgxgh_bLDbgbxG_Rjz-MQ1iE-H_Grj934E7ZAGvylOVXTKXP03VnuEZOW-S-ej8-u68WUyoLyp51iU |
link.rule.ids | 315,783,787,3514,4032,27936,27937,27938,45887 |
linkProvider | Elsevier |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV25TsNAEF1xFNBwI8I5BSUmdtbHmi5cSiCEgiDSrbyXFBA2AoNEgcRv8Ht8CTs-IpAQBa1XI1mj9cyb8bw3hOxKFUlljHREIm2BYkLtCAs7nDBqMansoSmklC76YefaPxsGwwlyVHNhcKyyiv1lTC-idfWkWXmz-TAaNa88FkWoXmIhDZYBbJJMI-0SK7Du8HDcaEG5k7jYvIsGDlrU6kPFnBcmCtTt9gLU8vSZ_3uG-pZ1ThfIXAUXoV2-0SKZ0OkSma9XMUD1ZS6Tt17Vd4TueK9tDoPsTtdES3h52oerXN_f22QFoxTaz3lm4apWcKzzYiArhcxAyUuCGy3wv_LT5_sHdC77yHwHZAdmKADwegDtFE7GQuEr5Pr0ZHDUcarNCo6kPssdE0hqoqQVRSz2jWKKBpLZUilmoUtp4LFEmEQoGXta-KEMIyFcGifaZnNtIYlPV8lUmqV6jQCjCVMx9UKptK-DBJchuTTQpmXBhHFFg-zV3uQPpYAGryfLbnnhfI7O5y7j1vkNEtYe5z_uAbch_i_D9f8a7pCZzuCix3vd_vkGmcWTssWySabyx2e9ZUFHLraLS_UFXBfXsA |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Language+Independent+Tokenization+vs.+Stemming+in+Automated+Detection+of+Health+Websites%E2%80%99+HONcode+Conformity%3A+An+Evaluation&rft.jtitle=Procedia+computer+science&rft.au=Boyer%2C+C%C3%A9lia&rft.au=Dolamic%2C+Ljiljana&rft.au=Falquet%2C+Gilles&rft.date=2015&rft.issn=1877-0509&rft.eissn=1877-0509&rft.volume=64&rft.spage=224&rft.epage=231&rft_id=info:doi/10.1016%2Fj.procs.2015.08.484&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_procs_2015_08_484 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1877-0509&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1877-0509&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1877-0509&client=summon |