Automated subject classification of textual Web pages, based on a controlled vocabulary: Challenges and recommendations
The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather he...
Saved in:
Published in | The new review of hypermedia and multimedia Vol. 12; no. 1; pp. 11 - 27 |
---|---|
Main Author | |
Format | Journal Article |
Language | English |
Published |
Taylor & Francis Group
01.06.2006
|
Subjects | |
Online Access | Get full text |
ISSN | 1361-4568 1740-7842 1740-7842 |
DOI | 10.1080/13614560600774313 |
Cover
Loading…
Abstract | The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather heterogeneous. The classification approach used comprises string-to-string matching between words in a term list extracted from the Ei (Engineering Information) thesaurus and classification scheme, and words in the text to be classified. Based on a sample of 70 Web pages, a number of problems with the term list are identified. Reasons for those problems are discussed and improvements proposed. Methods for implementing the improvements are also specified, suggesting further research. |
---|---|
AbstractList | The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather heterogeneous. The classification approach used comprises string-to-string matching between words in a term list extracted from the Ei (Engineering Information) thesaurus and classification scheme, and words in the text to be classified. Based on a sample of 70 Web pages, a number of problems with the term list are identified. Reasons for those problems are discussed and improvements proposed. Methods for implementing the improvements are also specified, suggesting further research. |
Author | Golub, Koraljka |
Author_xml | – sequence: 1 givenname: Koraljka surname: Golub fullname: Golub, Koraljka email: koraljka.golub@it.lth.se organization: KnowLib, Department of Information Technology , Lund University |
BackLink | https://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-37067$$DView record from Swedish Publication Index https://lup.lub.lu.se/record/608935$$DView record from Swedish Publication Index oai:portal.research.lu.se:publications/d110ca09-7881-4082-a44b-65ad246273a1$$DView record from Swedish Publication Index |
BookMark | eNqNUktv3CAQtqpUapL2B_TGqadsCxiMXfWy2j6llXrp44iGhxMiDC7gbvLvS7KrHhqtmgMaYL7HzGjOmpMQg22alwS_JrjHb0jbEcY73GEsBGtJ-6Q5JYLhlegZPan3ml9VQP-sOcv5GmMqetyeNrv1UuIExRqUF3VtdUHaQ85udBqKiwHFERV7Uxbw6KdVaIZLmy-QglwpNQ1Ix1BS9L6-f0cNavGQbt-izRXUv1DRCIJByeo4TTaYe9X8vHk6gs_2xSGeN98_fvi2-bzafv30ZbPerjSvTawocC6Gu-K5MEwpoqnBfYvHVmBumBEE664HRa2mI1eDqUGLlne05wY62p43sNfNOzsvSs7JTbU8GcHJOaYCXiabLSR9Jf0is5UV5Q-9Z2lINQA8SNH3RDLcUwmMKdlxMJR1VLRAqsf2qIdf5nrUQfuRchdH5d67H2sZ06X0YZF1Bp2o8Fd7-Jzir8XmIieXtfUego1LlnQQlAycVSDZA3WKOSc7_lUmWN4tkXywRJUj_uFoV-5nUxI4_ximC2NME-xi8kYWuPUxjQmCdvkhS5abUpnv_stsjxv_Ab848vA |
CitedBy_id | crossref_primary_10_1002_asi_23600 crossref_primary_10_1177_0165551511417785 crossref_primary_10_1016_j_knosys_2014_08_002 crossref_primary_10_1002_asi_21147 crossref_primary_10_1108_JD_07_2014_0103 crossref_primary_10_1002_asi_20790 crossref_primary_10_1108_LHT_11_2015_0109 crossref_primary_10_1108_LHT_04_2017_0066 crossref_primary_10_1177_0165551513514932 crossref_primary_10_1007_s11192_016_1836_2 crossref_primary_10_1108_LHT_03_2013_0030 |
Cites_doi | 10.1145/331499.331504 10.7551/mitpress/3828.001.0001 10.1109/WISE.2002.1181655 10.1108/EUM0000000007030 10.1007/11551362_33 10.1016/S0169-7552(98)00035-X 10.1016/0306-4573(88)90021-0 10.1002/1532-2890(2001)9999:9999<::AID-ASI1083>3.0.CO;2-1 |
ContentType | Journal Article |
Copyright | Copyright Taylor & Francis Group, LLC 2006 |
Copyright_xml | – notice: Copyright Taylor & Francis Group, LLC 2006 |
CorporateAuthor | Institutioner vid LTH Departments at LTH Lunds universitet Institutionen för elektro- och informationsteknik Faculty of Engineering, LTH Lunds Tekniska Högskola Lund University Department of Electrical and Information Technology |
CorporateAuthor_xml | – name: Faculty of Engineering, LTH – name: Lund University – name: Institutioner vid LTH – name: Lunds Tekniska Högskola – name: Departments at LTH – name: Lunds universitet – name: Department of Electrical and Information Technology – name: Institutionen för elektro- och informationsteknik |
DBID | AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D ADTPV AGRUY AOWAS D8T D92 ZZAVC AGCHP D95 |
DOI | 10.1080/13614560600774313 |
DatabaseName | CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional SwePub SWEPUB Linnéuniversitetet full text SwePub Articles SWEPUB Freely available online SWEPUB Linnéuniversitetet SwePub Articles full text SWEPUB Lunds universitet full text SWEPUB Lunds universitet |
DatabaseTitle | CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Computer and Information Systems Abstracts |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISSN | 1740-7842 |
EndPage | 27 |
ExternalDocumentID | oai_portal_research_lu_se_publications_d110ca09_7881_4082_a44b_65ad246273a1 oai_lup_lub_lu_se_d110ca09_7881_4082_a44b_65ad246273a1 oai_DiVA_org_lnu_37067 10_1080_13614560600774313 177384 |
GroupedDBID | .7F .DC .QJ 0BK 0R~ 123 29N 30N 3EH 4.4 5VS 77K AAENE AAJMT AALDU AAMIU AAPUL AAQRR ABCCY ABFIM ABHAV ABJNI ABLIJ ABPAQ ABPEM ABTAI ABXUL ABXYU ACGFS ACTIO ACTTO ADCVX ADGTB ADLRE ADMHG ADXPE AEISY AENEX AEOZL AEPSL AEYOC AFBWG AFION AFKVX AGDLA AGMYJ AGVKY AGWUF AHDZW AIJEM AJWEG AKBVH AKOOK ALMA_UNASSIGNED_HOLDINGS ALQZU ALRRR AQRUH AVBZW AWYRJ BLEHA BWMZZ CAG CCCUG CE4 COF CS3 CYRSC DAOYK DGEBU DKSSO DU5 EBS EJD E~A E~B GTTXZ H13 HZ~ H~P J.P KYCEM M4Z NA5 NX~ O9- OPCYK PQQKQ RIG RNANH ROSJB RTWRZ S-T SNACF TAJZE TBQAZ TDBHL TEN TFL TFT TFW TNC TTHFI TUROJ TWF UT5 UU3 ZGOLN ~S~ 07I 1TA 4B5 AAGDL AAHIA AAYXX ADUMR ADXEU ADYSH AEHZU AEZBV AFRVT AGBLW AIYEW AKHJE AKMBP ALXIB AMPGV BGSSV C0- C5H CITATION DEXXA FETWF HF~ IFELN IN- L8C LJTGL NUSFT TAP UB6 7SC 8FD JQ2 L7M L~C L~D ADTPV AGBKS AGRUY AGYFW AOWAS D8T D92 TASJS ZZAVC AGCHP D95 |
ID | FETCH-LOGICAL-c5313-2a5579136157d4bb1c2d0830f3705d4d710c68ab2ec2f5b9dc2fc7356285da623 |
ISSN | 1361-4568 1740-7842 |
IngestDate | Thu Aug 28 04:23:40 EDT 2025 Thu Jul 03 05:08:25 EDT 2025 Thu Aug 21 06:45:59 EDT 2025 Fri Jul 11 11:27:33 EDT 2025 Tue Jul 01 01:29:40 EDT 2025 Thu Apr 24 23:06:07 EDT 2025 Wed Dec 25 08:59:16 EST 2024 Mon May 13 12:09:30 EDT 2019 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 1 |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c5313-2a5579136157d4bb1c2d0830f3705d4d710c68ab2ec2f5b9dc2fc7356285da623 |
Notes | ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 |
OpenAccessLink | https://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-37067 |
PQID | 29721954 |
PQPubID | 23500 |
PageCount | 17 |
ParticipantIDs | proquest_miscellaneous_29721954 swepub_primary_oai_DiVA_org_lnu_37067 crossref_citationtrail_10_1080_13614560600774313 informaworld_taylorfrancis_310_1080_13614560600774313 crossref_primary_10_1080_13614560600774313 swepub_primary_oai_portal_research_lu_se_publications_d110ca09_7881_4082_a44b_65ad246273a1 swepub_primary_oai_lup_lub_lu_se_d110ca09_7881_4082_a44b_65ad246273a1 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2006-06-01 |
PublicationDateYYYYMMDD | 2006-06-01 |
PublicationDate_xml | – month: 06 year: 2006 text: 2006-06-01 day: 01 |
PublicationDecade | 2000 |
PublicationTitle | The new review of hypermedia and multimedia |
PublicationYear | 2006 |
Publisher | Taylor & Francis Group |
Publisher_xml | – name: Taylor & Francis Group |
References | CIT0030 Ardö A. (CIT0002) 1994; 17 CIT0012 CIT0011 Moens M.-F. (CIT0016) 2000 Browne G. (CIT0004) 2003; 18 Milstead J. (CIT0008) 1995 Tenopir C. (CIT0027) 1999; 124 Chan L.M. (CIT0006) 1994 CIT0014 CIT0013 CIT0018 CIT0017 CIT0021 CIT0001 CIT0023 CIT0022 Browne G. (CIT0005) 2003; 18 Svenonius E. (CIT0026) 2000 Lancaster F.W. (CIT0015) 2003 CIT0003 CIT0025 CIT0024 CIT0007 Plaunt C. (CIT0020) 1998; 49 CIT0029 Olson H.A. (CIT0019) 2001 CIT0028 CIT0009 |
References_xml | – ident: CIT0017 – volume-title: Indexing and Abstracting in Theory and Practice year: 2003 ident: CIT0015 – ident: CIT0009 – volume: 18 start-page: 7 year: 2003 ident: CIT0005 publication-title: Online Curr. – ident: CIT0012 doi: 10.1145/331499.331504 – volume: 17 start-page: 13 year: 1994 ident: CIT0002 publication-title: NORDINFO Nytt – volume: 18 start-page: 17 year: 2003 ident: CIT0004 publication-title: Online Curr. – volume-title: Ei Thesaurus year: 1995 ident: CIT0008 – ident: CIT0029 – volume-title: Subject Analysis in Online Catalogs year: 2001 ident: CIT0019 – ident: CIT0025 – volume: 124 start-page: 34 year: 1999 ident: CIT0027 publication-title: Library J. – volume-title: The Intellectual Foundations of Information Organization year: 2000 ident: CIT0026 doi: 10.7551/mitpress/3828.001.0001 – ident: CIT0021 doi: 10.1109/WISE.2002.1181655 – volume-title: Cataloging and Classification: An Introduction year: 1994 ident: CIT0006 – ident: CIT0023 doi: 10.1108/EUM0000000007030 – ident: CIT0018 – ident: CIT0014 – ident: CIT0011 doi: 10.1007/11551362_33 – ident: CIT0013 doi: 10.1016/S0169-7552(98)00035-X – ident: CIT0003 – volume: 49 start-page: 887 year: 1998 ident: CIT0020 publication-title: J. Am. Soc. Inform. Sci. – ident: CIT0028 – ident: CIT0001 – ident: CIT0024 doi: 10.1016/0306-4573(88)90021-0 – ident: CIT0007 – ident: CIT0030 – ident: CIT0022 doi: 10.1002/1532-2890(2001)9999:9999<::AID-ASI1083>3.0.CO;2-1 – volume-title: Automatic Indexing and Abstracting of Document Texts year: 2000 ident: CIT0016 |
SSID | ssj0027803 |
Score | 1.7679979 |
Snippet | The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web... |
SourceID | swepub proquest crossref informaworld |
SourceType | Open Access Repository Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 11 |
SubjectTerms | Automated subject classification Biblioteks- och informationsvetenskap Controlled vocabulary Electrical Engineering, Electronic Engineering, Information Engineering Elektroteknik och elektronik Engineering and Technology Engineering Information thesaurus and classification scheme Library and Information Science Teknik |
Title | Automated subject classification of textual Web pages, based on a controlled vocabulary: Challenges and recommendations |
URI | https://www.tandfonline.com/doi/abs/10.1080/13614560600774313 https://www.proquest.com/docview/29721954 https://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-37067 https://lup.lub.lu.se/record/608935 oai:portal.research.lu.se:publications/d110ca09-7881-4082-a44b-65ad246273a1 |
Volume | 12 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lj9MwELZg98KF5SmyvHyAC0sgL-dxrNhFFQhOu3S1F8t2HBao0qpNEOLXM-M4SUOqqnBI1KSJE3sm4288L0JeaO2JoohT0E1E7EYwxbiZKFIX0KkuRK4CkWGA86fP8fQi-nDJLvsCiya6pJJv1O-tcSX_Q1U4B3TFKNl_oGzXKJyA30Bf2AOFYb8XjSd1tQDECZhxXUtcUDlRCIbR-6dDgujZgSEiMy1PUHgYuuHclaOdQLS-6nM4_gnzmkS31IGpFxkJsPdGkMs16K4rE3JiTA_GJ9Ecds480Gtj5_mI8f_ff4i_1xZaHygrDhPjbxgN5WUw4otG-DVScySTGydGPwQgAPAK8-Ejagn7CWiQ6vr025cJX6y-8nlZ8zCByfQmOQxABwCpeziZnl7Nen06NYWvsWU3slGP7Qu3BuzUezt68gCCDBLUDtWMzdSxBm6c3yG3rZ5AJw3R75IburxHjtoaHNSK5Ptk1vEAtTxAhzxAFwW1PECBB6jhgdfUcACFvwXtOYD2HPCAXLw_O383dW2xDFeBGA3dQDCWZNhXluSRlL4KcoDXXgFjyPIoBySp4lTIQKugYDKDz7BQScgwhDYXAIIfkoNyUepHhEZhzBKR5UkgIhDpGjCkHws_VYAtWSiVQ7x2_LiymeSxoMmc-zbh7GjIHfKqu2XZpFHZdbG3SRRembWroik0M76cV78qh7Adt4Q7HvW8JTgHuYrGMlHqRb3mAaa1yljkkJcNH3TvvZ1LHXK25bp5vYRNwsbXmueAtZXwMo41HDiWeucwwpLHTORBFIMmIXyHXG1pp9HFuU0Adm3bW26s7O_V-PGenXlMbvXi4Ak5qFa1fgrgu5LP7Ef4B0u62Cs |
linkProvider | Library Specific Holdings |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9NAEF5BOcCl5SnMq3uAC8LFr13b3KLSKkCbUwsVl9U-ATWNo9rm9euZsdcloVGEevDBynjiXc_ufDMef0PIc2sj6RwvIDaRPMzAxYSldEUI6NQ6aXQiS_zA-XDCx8fZ-xN24hNutS-rxBja9UQR3V6NixuT0UNJ3Os4BacCrhq51dEDptfJDVbyHBdmGk3-BlxF1xkZxUOQL4a3mqtULPmlJdbSZey5yCfa-aD9LSKGu-9LT0532kbt6N__EDtefXi3yaaHp3TU29Mdcs3O7pKtofUD9TvBPfJj1DYVgF1raN0qzOVQjTgcC4-6Z00rR7GopAVtn6yiuG_Vryg6TUPhZ0l9kfwUzr-DQ1VYD_vrDd0durvUFEZAMWA_O7O-9VN9nxzv7x3tjkPfwyHUsLrTMJGM5SWOiOUmUyrWiQHUF7k0j5jJDAAczQupEqsTx1QJ1uF0njL8stNIwGYPyMasmtmHhGYpZ7ksTZ7IDHYaC9Am5jIuNEAeliodkGh4gkJ7gnPsszEVsedBvTSxAXl5ccm8Z_dYJxwtmoVoupSK6_ufXBYXzc8mIGzNJemav9oeTE7Acsd3OHJmq7YWCbItlSwLyIveEi_uG3nC3377OBLV-RcxnbUCZpjnAdlbITdt53AoOERthQEIqGVUCmwtILADuYAZVoIzaZKMA8CVcUA-r9DTh4jC81J99frmCwnn_1L-6IrTtE1ujo8OD8TBu8mHx-RWnx3DBNkTstGct_Yp4MVGPes2hT8LDV-C |
linkToPdf | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9QwELagSIgL5amGAvUBLoiUvOwk3FZtV-W14kCh4mL5CYhtdtUk5fHrmUmcsktXK9RDDqs4k7UznvnGHn9DyBNrI-kcLyA2kTzMwMWEpXRFCOjUOml0Iks84Pxuwg-PstfH7Njn5tQ-rRJjaNcTRXS2Gif33LghI-5FnIJPAU-N1OroANOr5BrHDT48wRFN_sZbRVcYGZuH0L4YNjVXiVhyS0ukpcvQc5FOtHNB482-zmrdMRdi5sn33bZRu_r3P7yOl-7dLXLTg1M66rXpNrliqztkcyj8QL0duEt-jNpmBlDXGlq3CldyqEYUjmlH3ZemM0cxpaQFaZ-somi16ucUXaahcFtSnyI_hd9n4E4VZsP-ekn3htouNYUOUAzXT06sL_xU3yNH44MPe4ehr-AQapjbaZhIxvISe8RykykV68QA5otcmkfMZAbgjeaFVInViWOqBN1wOk8Znus0EpDZfbJRzSq7RWiWcpbL0uSJzMDOWAA2MZdxoQHwsFTpgETDBxTa05tjlY2piD0L6oWBDciz80fmPbfHusbRolaIpltQcX31k4vNRfOzCQhb80i65lU7g8YJmOy4gyMrO2trkSDXUsmygDztFfH8fyNL-P63jyMxO_0iplUrYIR5HpCDFe2m7RwuBZeorTAAALWMSoGFBQTWHxcwwkpwJk2ScYC3Mg7I5xVy-gBReFaqr17efGG5-b-EP7jkMO2Q6-_3x-Ltq8mbbXKjXxrD1bGHZKM5be0jAIuNetyZhD9k1F4m |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Automated+subject+classification+of+textual+Web+pages%2C+based+on+a+controlled+vocabulary&rft.jtitle=The+new+review+of+hypermedia+and+multimedia&rft.au=Golub%2C+Koraljka&rft.date=2006-06-01&rft.issn=1740-7842&rft.volume=12&rft.issue=1&rft.spage=11&rft_id=info:doi/10.1080%2F13614560600774313&rft.externalDocID=oai_DiVA_org_lnu_37067 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1361-4568&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1361-4568&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1361-4568&client=summon |