Automatically Categorizing Software Technologies

Informal language and the absence of a standard taxonomy for software technologies make it difficult to reliably analyze technology trends on discussion forums and other on-line venues. We propose an automated approach called Witt for the categorization of software technologies (an expanded version...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on software engineering Vol. 46; no. 1; pp. 20 - 32
Main Authors Nassif, Mathieu, Treude, Christoph, Robillard, Martin P.
Format Journal Article
LanguageEnglish
Published New York IEEE 01.01.2020
IEEE Computer Society
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Informal language and the absence of a standard taxonomy for software technologies make it difficult to reliably analyze technology trends on discussion forums and other on-line venues. We propose an automated approach called Witt for the categorization of software technologies (an expanded version of the hypernym discovery problem). Witt takes as input a phrase describing a software technology or concept and returns a general category that describes it (e.g., integrated development environment), along with attributes that further qualify it (commercial, php, etc.). By extension, the approach enables the dynamic creation of lists of all technologies of a given type (e.g., web application frameworks). Our approach relies on Stack Overflow and Wikipedia, and involves numerous original domain adaptations and a new solution to the problem of normalizing automatically-detected hypernyms. We compared Witt with six independent taxonomy tools and found that, when applied to software terms, Witt demonstrated better coverage than all evaluated alternative solutions, without a corresponding degradation in false positive rate.
AbstractList Informal language and the absence of a standard taxonomy for software technologies make it difficult to reliably analyze technology trends on discussion forums and other on-line venues. We propose an automated approach called Witt for the categorization of software technologies (an expanded version of the hypernym discovery problem). Witt takes as input a phrase describing a software technology or concept and returns a general category that describes it (e.g., integrated development environment), along with attributes that further qualify it (commercial, php, etc.). By extension, the approach enables the dynamic creation of lists of all technologies of a given type (e.g., web application frameworks). Our approach relies on Stack Overflow and Wikipedia, and involves numerous original domain adaptations and a new solution to the problem of normalizing automatically-detected hypernyms. We compared Witt with six independent taxonomy tools and found that, when applied to software terms, Witt demonstrated better coverage than all evaluated alternative solutions, without a corresponding degradation in false positive rate.
Informal language and the absence of a standard taxonomy for software technologies make it difficult to reliably analyze technology trends on discussion forums and other on-line venues. We propose an automated approach called [Formula Omitted] for the categorization of software technologies (an expanded version of the hypernym discovery problem). [Formula Omitted] takes as input a phrase describing a software technology or concept and returns a general category that describes it (e.g., integrated development environment), along with attributes that further qualify it (commercial, php, etc.). By extension, the approach enables the dynamic creation of lists of all technologies of a given type (e.g., web application frameworks). Our approach relies on Stack Overflow and Wikipedia, and involves numerous original domain adaptations and a new solution to the problem of normalizing automatically-detected hypernyms. We compared [Formula Omitted] with six independent taxonomy tools and found that, when applied to software terms, [Formula Omitted] demonstrated better coverage than all evaluated alternative solutions, without a corresponding degradation in false positive rate.
Author Treude, Christoph
Nassif, Mathieu
Robillard, Martin P.
Author_xml – sequence: 1
  givenname: Mathieu
  orcidid: 0000-0003-0211-7256
  surname: Nassif
  fullname: Nassif, Mathieu
  email: mnassif@cs.mcgill.ca
  organization: School of Computer Science, McGill University, Montréal, QC, Canada
– sequence: 2
  givenname: Christoph
  orcidid: 0000-0002-6919-2149
  surname: Treude
  fullname: Treude, Christoph
  email: christoph.treude@adelaide.edu.au
  organization: School of Computer Science, University of Adelaide, Adelaide, SA, Australia
– sequence: 3
  givenname: Martin P.
  surname: Robillard
  fullname: Robillard, Martin P.
  email: martin@cs.mcgill.ca
  organization: School of Computer Science, McGill University, Montréal, QC, Canada
BookMark eNp9kE1PAjEQhhuDiYDeTbyQeF6cfm3bIyH4kZB4AM9NLVMsWbbYXWLw17sE4sGDp7m8z8y8z4D06lQjIbcUxpSCeVguZmMGVI-Z5qWQcEH61HBTcMmgR_oARhdSanNFBk2zAQCplOwTmOzbtHVt9K6qDqOpa3GdcvyO9Xq0SKH9chlHS_QfdarSOmJzTS6Dqxq8Oc8heXucLafPxfz16WU6mReeGdoW2pdOrDwKVFQFL1RZmhJArNy70SCD1A6FYrAChsEF1f0cFGrNlOBs5YAPyf1p7y6nzz02rd2kfa67k5ZxbjjvytAuVZ5SPqemyRisj23XJtVtdrGyFOzRju3s2KMde7bTgfAH3OW4dfnwH3J3QiIi_sY1l4YLwX8AFGBw2A
CODEN IESEDJ
CitedBy_id crossref_primary_10_1109_TSE_2021_3059885
crossref_primary_10_1142_S0218194023500274
crossref_primary_10_1051_e3sconf_202451203022
crossref_primary_10_1007_s10664_021_09962_8
crossref_primary_10_1109_TSE_2020_3016006
crossref_primary_10_1007_s10664_020_09918_4
crossref_primary_10_1016_j_jss_2019_07_033
crossref_primary_10_1109_TSE_2021_3120203
Cites_doi 10.1109/WCRE.2011.21
10.1145/2591062.2591071
10.1109/ICPC.2010.12
10.1007/978-3-642-35527-1_6
10.1093/ijl/3.4.235
10.3115/992133.992154
10.3115/v1/P14-5010
10.1145/1321440.1321475
10.1145/2063518.2063519
10.3115/1034678.1034705
10.3115/v1/P14-1089
10.1007/s10664-012-9231-y
10.1145/2124295.2124327
10.3115/1699571.1699634
10.1109/CSMR-WCRE.2014.6747213
10.1145/1458082.1458150
10.1109/ICSM.2012.6405332
10.1109/MSR.2013.6624040
10.1108/eb046814
10.1109/TSE.2010.91
10.1016/j.websem.2009.07.002
10.1007/s11704-013-2394-x
10.1145/2213836.2213891
10.2307/2529310
10.1109/MSR.2013.6624009
10.1007/s10664-013-9264-x
ContentType Journal Article
Copyright Copyright IEEE Computer Society 2020
Copyright_xml – notice: Copyright IEEE Computer Society 2020
DBID 97E
RIA
RIE
AAYXX
CITATION
JQ2
K9.
DOI 10.1109/TSE.2018.2836450
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Xplore
CrossRef
ProQuest Computer Science Collection
ProQuest Health & Medical Complete (Alumni)
DatabaseTitle CrossRef
ProQuest Health & Medical Complete (Alumni)
ProQuest Computer Science Collection
DatabaseTitleList
ProQuest Health & Medical Complete (Alumni)
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1939-3520
EndPage 32
ExternalDocumentID 10_1109_TSE_2018_2836450
8359344
Genre orig-research
GrantInformation_xml – fundername: Natural Sciences and Engineering Council of Canada
GroupedDBID --Z
-DZ
-~X
.DC
0R~
29I
4.4
5GY
6IK
85S
8R4
8R5
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABPPZ
ABQJQ
ABVLG
ACGFO
ACGOD
ACIWK
ACNCT
AENEX
AGQYO
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BKOMP
BPEOZ
CS3
DU5
EBS
EDO
EJD
HZ~
I-F
IEDLZ
IFIPE
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
Q2X
RIA
RIE
RNS
RXW
S10
TAE
TN5
TWZ
UHB
UPT
WH7
YZZ
AAYXX
ALIPV
CITATION
JQ2
K9.
ID FETCH-LOGICAL-c291t-8c6a4dce4e717fc476696004dab9805f58ae4720d02efaf7364f7e8827432da03
IEDL.DBID RIE
ISSN 0098-5589
IngestDate Mon Jun 30 11:06:01 EDT 2025
Thu Apr 24 23:10:43 EDT 2025
Tue Jul 01 01:53:16 EDT 2025
Wed Aug 27 02:41:20 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c291t-8c6a4dce4e717fc476696004dab9805f58ae4720d02efaf7364f7e8827432da03
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-0211-7256
0000-0002-6919-2149
PQID 2339335581
PQPubID 21418
PageCount 13
ParticipantIDs crossref_citationtrail_10_1109_TSE_2018_2836450
crossref_primary_10_1109_TSE_2018_2836450
ieee_primary_8359344
proquest_journals_2339335581
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2020-Jan.-1
2020-1-1
20200101
PublicationDateYYYYMMDD 2020-01-01
PublicationDate_xml – month: 01
  year: 2020
  text: 2020-Jan.-1
  day: 01
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on software engineering
PublicationTitleAbbrev TSE
PublicationYear 2020
Publisher IEEE
IEEE Computer Society
Publisher_xml – name: IEEE
– name: IEEE Computer Society
References ref13
ref37
levenshtein (ref18) 1966; 10
ref36
treude (ref38) 2011
ref31
ref11
ref10
ritter (ref30) 2009
ref1
manning (ref21) 1999
ref39
ref17
(ref7) 2012
ref19
stanley (ref34) 2013
de marneffe (ref8) 2008
snow (ref33) 2004
nakayama (ref27) 2008
ref24
ref45
ref23
ref26
ref25
ref42
ref41
zesch (ref46) 2008
ref22
ref44
(ref35) 2018
ref43
carvalho (ref5) 2012
ref28
ref29
dojchinovski (ref9) 2013
gravano (ref12) 2001; 24
bird (ref2) 2009
ref4
ref3
ref6
lo (ref20) 2012
kliegr (ref15) 2008
ref40
jaccard (ref14) 1901; 37
seitner (ref32) 2016
kozareva (ref16) 2010
References_xml – start-page: 1297
  year: 2004
  ident: ref33
  article-title: Learning syntactic patterns for automatic hypernym discovery
  publication-title: Proc 18th Annu Conf Neural Inf Process Syst
– ident: ref28
  doi: 10.1109/WCRE.2011.21
– year: 2012
  ident: ref7
  publication-title: DCMI Metadata terms
– ident: ref37
  doi: 10.1145/2591062.2591071
– start-page: 360
  year: 2016
  ident: ref32
  article-title: A large database of hypernymy relations extracted from the web
  publication-title: Proc 10th edition Lang Resources Evaluation Conf
– start-page: 1110
  year: 2010
  ident: ref16
  article-title: A semi-supervised method to learn and construct taxonomies using the web
  publication-title: Proc Conf Empirical Methods Natural Language Process
– ident: ref10
  doi: 10.1109/ICPC.2010.12
– ident: ref19
  doi: 10.1007/978-3-642-35527-1_6
– start-page: 600
  year: 2012
  ident: ref20
  article-title: Detecting similar applications with collaborative tagging
  publication-title: Proc Int Conf Softw Maintenance
– year: 2018
  ident: ref35
– ident: ref25
  doi: 10.1093/ijl/3.4.235
– ident: ref13
  doi: 10.3115/992133.992154
– ident: ref22
  doi: 10.3115/v1/P14-5010
– start-page: 38
  year: 2008
  ident: ref15
  article-title: Wikipedia as the premiere source for targeted hypernym discovery
  publication-title: Proc Wikis Blogs Bookmarking Tools Mining Web 2 0 Workshop Eur Conf Mach Learn Principles Practice Knowl Discovery Databases
– start-page: 654
  year: 2013
  ident: ref9
  article-title: Entityclassifier.eu: Real-time classification of entities in text with Wikipedia
  publication-title: Machine Learning and Knowledge Discovery in Databases
– year: 1999
  ident: ref21
  publication-title: Foundations of Statistical Natural Language Processing
– ident: ref24
  doi: 10.1145/1321440.1321475
– start-page: 804
  year: 2011
  ident: ref38
  article-title: How do programmers ask and answer questions on the web? (NIER Track)
  publication-title: Proc 3rd Int Conf Software Engineering
– start-page: 414
  year: 2013
  ident: ref34
  article-title: Predicting tags for StackOverflow posts
  publication-title: Proc Int Conf Cognitive Model
– volume: 37
  start-page: 547
  year: 1901
  ident: ref14
  article-title: Étude comparative de la distribution florale dans une portion des Alpes et des Jura
  publication-title: Bulletin de la Socit Vaudoise des Sciences Naturelles
– volume: 24
  start-page: 28
  year: 2001
  ident: ref12
  article-title: Using q-grams in a DBMS for approximate string processing
  publication-title: IEEE Data Eng Bulletin
– ident: ref23
  doi: 10.1145/2063518.2063519
– ident: ref4
  doi: 10.3115/1034678.1034705
– year: 2009
  ident: ref2
  publication-title: Natural Language Processing with Python Analyzing Text with the Natural Language Toolkit
– ident: ref11
  doi: 10.3115/v1/P14-1089
– ident: ref1
  doi: 10.1007/s10664-012-9231-y
– start-page: 59
  year: 2008
  ident: ref27
  article-title: Wikipedia link structure and text mining for semantic relation extraction
  publication-title: Proc 5th Eur Semantic Web Conf Workshop Semantic Search
– ident: ref6
  doi: 10.1145/2124295.2124327
– ident: ref44
  doi: 10.3115/1699571.1699634
– ident: ref36
  doi: 10.1109/CSMR-WCRE.2014.6747213
– ident: ref26
  doi: 10.1145/1458082.1458150
– ident: ref40
  doi: 10.1109/ICSM.2012.6405332
– start-page: 1646
  year: 2008
  ident: ref46
  article-title: Extracting lexical semantic knowledge from Wikipedia and Wiktionary
  publication-title: Proc Conf Language Resources Eval Electron Proc
– ident: ref43
  doi: 10.1109/MSR.2013.6624040
– ident: ref29
  doi: 10.1108/eb046814
– start-page: 88
  year: 2009
  ident: ref30
  article-title: What is this, anyway: Automatic hypernym discovery
  publication-title: Proceedings of the AAAI Spring Symposium on Learning by Reading and Learning to Read
– ident: ref39
  doi: 10.1109/TSE.2010.91
– ident: ref3
  doi: 10.1016/j.websem.2009.07.002
– ident: ref41
  doi: 10.1007/s11704-013-2394-x
– ident: ref42
  doi: 10.1145/2213836.2213891
– year: 2008
  ident: ref8
  publication-title: Stanford Typed Dependencies manual
– ident: ref17
  doi: 10.2307/2529310
– volume: 10
  start-page: 707
  year: 1966
  ident: ref18
  article-title: Binary codes capable of correcting deletions, insertions, and reversals
  publication-title: Soviet Physics Doklady
– ident: ref31
  doi: 10.1109/MSR.2013.6624009
– start-page: 239
  year: 2012
  ident: ref5
  article-title: Probabilistic Synset based concept location
  publication-title: Proc 1st Symp Languages Appl Technol
– ident: ref45
  doi: 10.1007/s10664-013-9264-x
SSID ssj0005775
ssib053395008
Score 2.3804638
Snippet Informal language and the absence of a standard taxonomy for software technologies make it difficult to reliably analyze technology trends on discussion forums...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 20
SubjectTerms Applications programs
Electronic publishing
Encyclopedias
information retrieval
Internet
natural language processing
Normalizing
Software
tagging
Taxonomy
Technology assessment
Wikipedia
Title Automatically Categorizing Software Technologies
URI https://ieeexplore.ieee.org/document/8359344
https://www.proquest.com/docview/2339335581
Volume 46
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEB3anrz4VcX6xR68CG6bZrObzbGUliLUS1vobclmJyCWVuoW0V_vZLtbRUW87SELYSbJvDeZvAG44RY1Q2WI5CjuC4wi18hd-6HTzrI2TYV1-Y7xQzSaift5OK_B3e4tDCIWxWfYdp_FXX62MhuXKusQWlCBEHWoE3HbvtX6LOeQMqz0McMwVtWVJFOd6WTgarjiNoXSSLgX9l9CUNFT5cdBXESX4QGMq3lti0qe2ps8bZv3b5KN_534IeyXMNPrbdfFEdRweQwHVQsHr9zRTWC9Tb4qVFv1YvHm9Z1uxGr9-E4BzZvQCf2q1-jt0u_Eqk9gNhxM-yO_bKLgG666uR-bSIvMoEAibtYIGUVEWpjIdKpiFtow1igkZxnjaLWVZCErkXA3QQueaRacQmO5WuIZeFJmNsYsThk3gmgcIRnCJ2hFxLBrJG9Bp7JrYkqFcdfoYpEUTIOphDyROE8kpSdacLv743mrrvHH2KYz7G5cadMWXFauS8rt95LwIFCBE47vnv_-1wXscUeci1zKJTTy9QavCF3k6XWxrD4AlvfKIw
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwED7xGGDhVRDlmYEFibSu48TxiCqq8igLReoWOc5ZQlQtKqkQ_fWc06QgQIgtgyNZd7bv-87n7wDOuEXNUBkiOYr7AqPINXLXfui0s6xNU2FdvqN3H3Ufxc0gHCzBxeItDCIWxWfYcJ_FXX42NlOXKmsSWlCBEMuwSnE_5PPXWp8FHVKGlUJmGMaqupRkqtl_uHJVXHGDgmkk3Bv7L0Go6Kry4ygu4ktnE3rVzOZlJc-NaZ42zOybaON_p74FGyXQ9C7nK2MblnC0A5tVEwev3NM1YJfTfFzoturh8N1rO-WI8eRpRiHNe6Az-k1P0Fsk4IlX78Jj56rf7vplGwXfcNXK_dhEWmQGBRJ1s0bIKCLawkSmUxWz0IaxRiE5yxhHq60kC1mJhLwJXPBMs2APVkbjEe6DJ2VmY8zilHEjiMgRliGEglZEDFtG8jo0K7smptQYd60uhknBNZhKyBOJ80RSeqIO54s_Xub6Gn-MrTnDLsaVNq3DUeW6pNyArwkPAhU46fjWwe9_ncJat9-7S-6u728PYZ07Gl1kVo5gJZ9M8ZiwRp6eFEvsA1PVzW0
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Automatically+Categorizing+Software+Technologies&rft.jtitle=IEEE+transactions+on+software+engineering&rft.au=Nassif%2C+Mathieu&rft.au=Treude%2C+Christoph&rft.au=Robillard%2C+Martin+P&rft.date=2020-01-01&rft.pub=IEEE+Computer+Society&rft.issn=0098-5589&rft.eissn=1939-3520&rft.volume=46&rft.issue=1&rft.spage=20&rft_id=info:doi/10.1109%2FTSE.2018.2836450&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0098-5589&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0098-5589&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0098-5589&client=summon