Automated subject classification of textual Web pages, based on a controlled vocabulary: Challenges and recommendations

The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather he...

Full description

Saved in:
Bibliographic Details
Published inThe new review of hypermedia and multimedia Vol. 12; no. 1; pp. 11 - 27
Main Author Golub, Koraljka
Format Journal Article
LanguageEnglish
Published Taylor & Francis Group 01.06.2006
Subjects
Online AccessGet full text
ISSN1361-4568
1740-7842
1740-7842
DOI10.1080/13614560600774313

Cover

Loading…
Abstract The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather heterogeneous. The classification approach used comprises string-to-string matching between words in a term list extracted from the Ei (Engineering Information) thesaurus and classification scheme, and words in the text to be classified. Based on a sample of 70 Web pages, a number of problems with the term list are identified. Reasons for those problems are discussed and improvements proposed. Methods for implementing the improvements are also specified, suggesting further research.
AbstractList The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather heterogeneous. The classification approach used comprises string-to-string matching between words in a term list extracted from the Ei (Engineering Information) thesaurus and classification scheme, and words in the text to be classified. Based on a sample of 70 Web pages, a number of problems with the term list are identified. Reasons for those problems are discussed and improvements proposed. Methods for implementing the improvements are also specified, suggesting further research.
Author Golub, Koraljka
Author_xml – sequence: 1
  givenname: Koraljka
  surname: Golub
  fullname: Golub, Koraljka
  email: koraljka.golub@it.lth.se
  organization: KnowLib, Department of Information Technology , Lund University
BackLink https://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-37067$$DView record from Swedish Publication Index
https://lup.lub.lu.se/record/608935$$DView record from Swedish Publication Index
oai:portal.research.lu.se:publications/d110ca09-7881-4082-a44b-65ad246273a1$$DView record from Swedish Publication Index
BookMark eNqNUktv3CAQtqpUapL2B_TGqadsCxiMXfWy2j6llXrp44iGhxMiDC7gbvLvS7KrHhqtmgMaYL7HzGjOmpMQg22alwS_JrjHb0jbEcY73GEsBGtJ-6Q5JYLhlegZPan3ml9VQP-sOcv5GmMqetyeNrv1UuIExRqUF3VtdUHaQ85udBqKiwHFERV7Uxbw6KdVaIZLmy-QglwpNQ1Ix1BS9L6-f0cNavGQbt-izRXUv1DRCIJByeo4TTaYe9X8vHk6gs_2xSGeN98_fvi2-bzafv30ZbPerjSvTawocC6Gu-K5MEwpoqnBfYvHVmBumBEE664HRa2mI1eDqUGLlne05wY62p43sNfNOzsvSs7JTbU8GcHJOaYCXiabLSR9Jf0is5UV5Q-9Z2lINQA8SNH3RDLcUwmMKdlxMJR1VLRAqsf2qIdf5nrUQfuRchdH5d67H2sZ06X0YZF1Bp2o8Fd7-Jzir8XmIieXtfUego1LlnQQlAycVSDZA3WKOSc7_lUmWN4tkXywRJUj_uFoV-5nUxI4_ximC2NME-xi8kYWuPUxjQmCdvkhS5abUpnv_stsjxv_Ab848vA
CitedBy_id crossref_primary_10_1002_asi_23600
crossref_primary_10_1177_0165551511417785
crossref_primary_10_1016_j_knosys_2014_08_002
crossref_primary_10_1002_asi_21147
crossref_primary_10_1108_JD_07_2014_0103
crossref_primary_10_1002_asi_20790
crossref_primary_10_1108_LHT_11_2015_0109
crossref_primary_10_1108_LHT_04_2017_0066
crossref_primary_10_1177_0165551513514932
crossref_primary_10_1007_s11192_016_1836_2
crossref_primary_10_1108_LHT_03_2013_0030
Cites_doi 10.1145/331499.331504
10.7551/mitpress/3828.001.0001
10.1109/WISE.2002.1181655
10.1108/EUM0000000007030
10.1007/11551362_33
10.1016/S0169-7552(98)00035-X
10.1016/0306-4573(88)90021-0
10.1002/1532-2890(2001)9999:9999<::AID-ASI1083>3.0.CO;2-1
ContentType Journal Article
Copyright Copyright Taylor & Francis Group, LLC 2006
Copyright_xml – notice: Copyright Taylor & Francis Group, LLC 2006
CorporateAuthor Institutioner vid LTH
Departments at LTH
Lunds universitet
Institutionen för elektro- och informationsteknik
Faculty of Engineering, LTH
Lunds Tekniska Högskola
Lund University
Department of Electrical and Information Technology
CorporateAuthor_xml – name: Faculty of Engineering, LTH
– name: Lund University
– name: Institutioner vid LTH
– name: Lunds Tekniska Högskola
– name: Departments at LTH
– name: Lunds universitet
– name: Department of Electrical and Information Technology
– name: Institutionen för elektro- och informationsteknik
DBID AAYXX
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
ADTPV
AGRUY
AOWAS
D8T
D92
ZZAVC
AGCHP
D95
DOI 10.1080/13614560600774313
DatabaseName CrossRef
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
SwePub
SWEPUB Linnéuniversitetet full text
SwePub Articles
SWEPUB Freely available online
SWEPUB Linnéuniversitetet
SwePub Articles full text
SWEPUB Lunds universitet full text
SWEPUB Lunds universitet
DatabaseTitle CrossRef
Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList

Computer and Information Systems Abstracts
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1740-7842
EndPage 27
ExternalDocumentID oai_portal_research_lu_se_publications_d110ca09_7881_4082_a44b_65ad246273a1
oai_lup_lub_lu_se_d110ca09_7881_4082_a44b_65ad246273a1
oai_DiVA_org_lnu_37067
10_1080_13614560600774313
177384
GroupedDBID .7F
.DC
.QJ
0BK
0R~
123
29N
30N
3EH
4.4
5VS
77K
AAENE
AAJMT
AALDU
AAMIU
AAPUL
AAQRR
ABCCY
ABFIM
ABHAV
ABJNI
ABLIJ
ABPAQ
ABPEM
ABTAI
ABXUL
ABXYU
ACGFS
ACTIO
ACTTO
ADCVX
ADGTB
ADLRE
ADMHG
ADXPE
AEISY
AENEX
AEOZL
AEPSL
AEYOC
AFBWG
AFION
AFKVX
AGDLA
AGMYJ
AGVKY
AGWUF
AHDZW
AIJEM
AJWEG
AKBVH
AKOOK
ALMA_UNASSIGNED_HOLDINGS
ALQZU
ALRRR
AQRUH
AVBZW
AWYRJ
BLEHA
BWMZZ
CAG
CCCUG
CE4
COF
CS3
CYRSC
DAOYK
DGEBU
DKSSO
DU5
EBS
EJD
E~A
E~B
GTTXZ
H13
HZ~
H~P
J.P
KYCEM
M4Z
NA5
NX~
O9-
OPCYK
PQQKQ
RIG
RNANH
ROSJB
RTWRZ
S-T
SNACF
TAJZE
TBQAZ
TDBHL
TEN
TFL
TFT
TFW
TNC
TTHFI
TUROJ
TWF
UT5
UU3
ZGOLN
~S~
07I
1TA
4B5
AAGDL
AAHIA
AAYXX
ADUMR
ADXEU
ADYSH
AEHZU
AEZBV
AFRVT
AGBLW
AIYEW
AKHJE
AKMBP
ALXIB
AMPGV
BGSSV
C0-
C5H
CITATION
DEXXA
FETWF
HF~
IFELN
IN-
L8C
LJTGL
NUSFT
TAP
UB6
7SC
8FD
JQ2
L7M
L~C
L~D
ADTPV
AGBKS
AGRUY
AGYFW
AOWAS
D8T
D92
TASJS
ZZAVC
AGCHP
D95
ID FETCH-LOGICAL-c5313-2a5579136157d4bb1c2d0830f3705d4d710c68ab2ec2f5b9dc2fc7356285da623
ISSN 1361-4568
1740-7842
IngestDate Thu Aug 28 04:23:40 EDT 2025
Thu Jul 03 05:08:25 EDT 2025
Thu Aug 21 06:45:59 EDT 2025
Fri Jul 11 11:27:33 EDT 2025
Tue Jul 01 01:29:40 EDT 2025
Thu Apr 24 23:06:07 EDT 2025
Wed Dec 25 08:59:16 EST 2024
Mon May 13 12:09:30 EDT 2019
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c5313-2a5579136157d4bb1c2d0830f3705d4d710c68ab2ec2f5b9dc2fc7356285da623
Notes ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 23
OpenAccessLink https://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-37067
PQID 29721954
PQPubID 23500
PageCount 17
ParticipantIDs proquest_miscellaneous_29721954
swepub_primary_oai_DiVA_org_lnu_37067
crossref_citationtrail_10_1080_13614560600774313
informaworld_taylorfrancis_310_1080_13614560600774313
crossref_primary_10_1080_13614560600774313
swepub_primary_oai_portal_research_lu_se_publications_d110ca09_7881_4082_a44b_65ad246273a1
swepub_primary_oai_lup_lub_lu_se_d110ca09_7881_4082_a44b_65ad246273a1
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2006-06-01
PublicationDateYYYYMMDD 2006-06-01
PublicationDate_xml – month: 06
  year: 2006
  text: 2006-06-01
  day: 01
PublicationDecade 2000
PublicationTitle The new review of hypermedia and multimedia
PublicationYear 2006
Publisher Taylor & Francis Group
Publisher_xml – name: Taylor & Francis Group
References CIT0030
Ardö A. (CIT0002) 1994; 17
CIT0012
CIT0011
Moens M.-F. (CIT0016) 2000
Browne G. (CIT0004) 2003; 18
Milstead J. (CIT0008) 1995
Tenopir C. (CIT0027) 1999; 124
Chan L.M. (CIT0006) 1994
CIT0014
CIT0013
CIT0018
CIT0017
CIT0021
CIT0001
CIT0023
CIT0022
Browne G. (CIT0005) 2003; 18
Svenonius E. (CIT0026) 2000
Lancaster F.W. (CIT0015) 2003
CIT0003
CIT0025
CIT0024
CIT0007
Plaunt C. (CIT0020) 1998; 49
CIT0029
Olson H.A. (CIT0019) 2001
CIT0028
CIT0009
References_xml – ident: CIT0017
– volume-title: Indexing and Abstracting in Theory and Practice
  year: 2003
  ident: CIT0015
– ident: CIT0009
– volume: 18
  start-page: 7
  year: 2003
  ident: CIT0005
  publication-title: Online Curr.
– ident: CIT0012
  doi: 10.1145/331499.331504
– volume: 17
  start-page: 13
  year: 1994
  ident: CIT0002
  publication-title: NORDINFO Nytt
– volume: 18
  start-page: 17
  year: 2003
  ident: CIT0004
  publication-title: Online Curr.
– volume-title: Ei Thesaurus
  year: 1995
  ident: CIT0008
– ident: CIT0029
– volume-title: Subject Analysis in Online Catalogs
  year: 2001
  ident: CIT0019
– ident: CIT0025
– volume: 124
  start-page: 34
  year: 1999
  ident: CIT0027
  publication-title: Library J.
– volume-title: The Intellectual Foundations of Information Organization
  year: 2000
  ident: CIT0026
  doi: 10.7551/mitpress/3828.001.0001
– ident: CIT0021
  doi: 10.1109/WISE.2002.1181655
– volume-title: Cataloging and Classification: An Introduction
  year: 1994
  ident: CIT0006
– ident: CIT0023
  doi: 10.1108/EUM0000000007030
– ident: CIT0018
– ident: CIT0014
– ident: CIT0011
  doi: 10.1007/11551362_33
– ident: CIT0013
  doi: 10.1016/S0169-7552(98)00035-X
– ident: CIT0003
– volume: 49
  start-page: 887
  year: 1998
  ident: CIT0020
  publication-title: J. Am. Soc. Inform. Sci.
– ident: CIT0028
– ident: CIT0001
– ident: CIT0024
  doi: 10.1016/0306-4573(88)90021-0
– ident: CIT0007
– ident: CIT0030
– ident: CIT0022
  doi: 10.1002/1532-2890(2001)9999:9999<::AID-ASI1083>3.0.CO;2-1
– volume-title: Automatic Indexing and Abstracting of Document Texts
  year: 2000
  ident: CIT0016
SSID ssj0027803
Score 1.7679979
Snippet The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web...
SourceID swepub
proquest
crossref
informaworld
SourceType Open Access Repository
Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 11
SubjectTerms Automated subject classification
Biblioteks- och informationsvetenskap
Controlled vocabulary
Electrical Engineering, Electronic Engineering, Information Engineering
Elektroteknik och elektronik
Engineering and Technology
Engineering Information thesaurus and classification scheme
Library and Information Science
Teknik
Title Automated subject classification of textual Web pages, based on a controlled vocabulary: Challenges and recommendations
URI https://www.tandfonline.com/doi/abs/10.1080/13614560600774313
https://www.proquest.com/docview/29721954
https://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-37067
https://lup.lub.lu.se/record/608935
oai:portal.research.lu.se:publications/d110ca09-7881-4082-a44b-65ad246273a1
Volume 12
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lj9MwELZg98KF5SmyvHyAC0sgL-dxrNhFFQhOu3S1F8t2HBao0qpNEOLXM-M4SUOqqnBI1KSJE3sm4288L0JeaO2JoohT0E1E7EYwxbiZKFIX0KkuRK4CkWGA86fP8fQi-nDJLvsCiya6pJJv1O-tcSX_Q1U4B3TFKNl_oGzXKJyA30Bf2AOFYb8XjSd1tQDECZhxXUtcUDlRCIbR-6dDgujZgSEiMy1PUHgYuuHclaOdQLS-6nM4_gnzmkS31IGpFxkJsPdGkMs16K4rE3JiTA_GJ9Ecds480Gtj5_mI8f_ff4i_1xZaHygrDhPjbxgN5WUw4otG-DVScySTGydGPwQgAPAK8-Ejagn7CWiQ6vr025cJX6y-8nlZ8zCByfQmOQxABwCpeziZnl7Nen06NYWvsWU3slGP7Qu3BuzUezt68gCCDBLUDtWMzdSxBm6c3yG3rZ5AJw3R75IburxHjtoaHNSK5Ptk1vEAtTxAhzxAFwW1PECBB6jhgdfUcACFvwXtOYD2HPCAXLw_O383dW2xDFeBGA3dQDCWZNhXluSRlL4KcoDXXgFjyPIoBySp4lTIQKugYDKDz7BQScgwhDYXAIIfkoNyUepHhEZhzBKR5UkgIhDpGjCkHws_VYAtWSiVQ7x2_LiymeSxoMmc-zbh7GjIHfKqu2XZpFHZdbG3SRRembWroik0M76cV78qh7Adt4Q7HvW8JTgHuYrGMlHqRb3mAaa1yljkkJcNH3TvvZ1LHXK25bp5vYRNwsbXmueAtZXwMo41HDiWeucwwpLHTORBFIMmIXyHXG1pp9HFuU0Adm3bW26s7O_V-PGenXlMbvXi4Ak5qFa1fgrgu5LP7Ef4B0u62Cs
linkProvider Library Specific Holdings
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9NAEF5BOcCl5SnMq3uAC8LFr13b3KLSKkCbUwsVl9U-ATWNo9rm9euZsdcloVGEevDBynjiXc_ufDMef0PIc2sj6RwvIDaRPMzAxYSldEUI6NQ6aXQiS_zA-XDCx8fZ-xN24hNutS-rxBja9UQR3V6NixuT0UNJ3Os4BacCrhq51dEDptfJDVbyHBdmGk3-BlxF1xkZxUOQL4a3mqtULPmlJdbSZey5yCfa-aD9LSKGu-9LT0532kbt6N__EDtefXi3yaaHp3TU29Mdcs3O7pKtofUD9TvBPfJj1DYVgF1raN0qzOVQjTgcC4-6Z00rR7GopAVtn6yiuG_Vryg6TUPhZ0l9kfwUzr-DQ1VYD_vrDd0durvUFEZAMWA_O7O-9VN9nxzv7x3tjkPfwyHUsLrTMJGM5SWOiOUmUyrWiQHUF7k0j5jJDAAczQupEqsTx1QJ1uF0njL8stNIwGYPyMasmtmHhGYpZ7ksTZ7IDHYaC9Am5jIuNEAeliodkGh4gkJ7gnPsszEVsedBvTSxAXl5ccm8Z_dYJxwtmoVoupSK6_ufXBYXzc8mIGzNJemav9oeTE7Acsd3OHJmq7YWCbItlSwLyIveEi_uG3nC3377OBLV-RcxnbUCZpjnAdlbITdt53AoOERthQEIqGVUCmwtILADuYAZVoIzaZKMA8CVcUA-r9DTh4jC81J99frmCwnn_1L-6IrTtE1ujo8OD8TBu8mHx-RWnx3DBNkTstGct_Yp4MVGPes2hT8LDV-C
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9QwELagSIgL5amGAvUBLoiUvOwk3FZtV-W14kCh4mL5CYhtdtUk5fHrmUmcsktXK9RDDqs4k7UznvnGHn9DyBNrI-kcLyA2kTzMwMWEpXRFCOjUOml0Iks84Pxuwg-PstfH7Njn5tQ-rRJjaNcTRXS2Gif33LghI-5FnIJPAU-N1OroANOr5BrHDT48wRFN_sZbRVcYGZuH0L4YNjVXiVhyS0ukpcvQc5FOtHNB482-zmrdMRdi5sn33bZRu_r3P7yOl-7dLXLTg1M66rXpNrliqztkcyj8QL0duEt-jNpmBlDXGlq3CldyqEYUjmlH3ZemM0cxpaQFaZ-somi16ucUXaahcFtSnyI_hd9n4E4VZsP-ekn3htouNYUOUAzXT06sL_xU3yNH44MPe4ehr-AQapjbaZhIxvISe8RykykV68QA5otcmkfMZAbgjeaFVInViWOqBN1wOk8Znus0EpDZfbJRzSq7RWiWcpbL0uSJzMDOWAA2MZdxoQHwsFTpgETDBxTa05tjlY2piD0L6oWBDciz80fmPbfHusbRolaIpltQcX31k4vNRfOzCQhb80i65lU7g8YJmOy4gyMrO2trkSDXUsmygDztFfH8fyNL-P63jyMxO_0iplUrYIR5HpCDFe2m7RwuBZeorTAAALWMSoGFBQTWHxcwwkpwJk2ScYC3Mg7I5xVy-gBReFaqr17efGG5-b-EP7jkMO2Q6-_3x-Ltq8mbbXKjXxrD1bGHZKM5be0jAIuNetyZhD9k1F4m
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Automated+subject+classification+of+textual+Web+pages%2C+based+on+a+controlled+vocabulary&rft.jtitle=The+new+review+of+hypermedia+and+multimedia&rft.au=Golub%2C+Koraljka&rft.date=2006-06-01&rft.issn=1740-7842&rft.volume=12&rft.issue=1&rft.spage=11&rft_id=info:doi/10.1080%2F13614560600774313&rft.externalDocID=oai_DiVA_org_lnu_37067
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1361-4568&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1361-4568&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1361-4568&client=summon