Automatically Categorizing Software Technologies
Informal language and the absence of a standard taxonomy for software technologies make it difficult to reliably analyze technology trends on discussion forums and other on-line venues. We propose an automated approach called Witt for the categorization of software technologies (an expanded version...
Saved in:
Published in | IEEE transactions on software engineering Vol. 46; no. 1; pp. 20 - 32 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.01.2020
IEEE Computer Society |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Informal language and the absence of a standard taxonomy for software technologies make it difficult to reliably analyze technology trends on discussion forums and other on-line venues. We propose an automated approach called Witt for the categorization of software technologies (an expanded version of the hypernym discovery problem). Witt takes as input a phrase describing a software technology or concept and returns a general category that describes it (e.g., integrated development environment), along with attributes that further qualify it (commercial, php, etc.). By extension, the approach enables the dynamic creation of lists of all technologies of a given type (e.g., web application frameworks). Our approach relies on Stack Overflow and Wikipedia, and involves numerous original domain adaptations and a new solution to the problem of normalizing automatically-detected hypernyms. We compared Witt with six independent taxonomy tools and found that, when applied to software terms, Witt demonstrated better coverage than all evaluated alternative solutions, without a corresponding degradation in false positive rate. |
---|---|
AbstractList | Informal language and the absence of a standard taxonomy for software technologies make it difficult to reliably analyze technology trends on discussion forums and other on-line venues. We propose an automated approach called Witt for the categorization of software technologies (an expanded version of the hypernym discovery problem). Witt takes as input a phrase describing a software technology or concept and returns a general category that describes it (e.g., integrated development environment), along with attributes that further qualify it (commercial, php, etc.). By extension, the approach enables the dynamic creation of lists of all technologies of a given type (e.g., web application frameworks). Our approach relies on Stack Overflow and Wikipedia, and involves numerous original domain adaptations and a new solution to the problem of normalizing automatically-detected hypernyms. We compared Witt with six independent taxonomy tools and found that, when applied to software terms, Witt demonstrated better coverage than all evaluated alternative solutions, without a corresponding degradation in false positive rate. Informal language and the absence of a standard taxonomy for software technologies make it difficult to reliably analyze technology trends on discussion forums and other on-line venues. We propose an automated approach called [Formula Omitted] for the categorization of software technologies (an expanded version of the hypernym discovery problem). [Formula Omitted] takes as input a phrase describing a software technology or concept and returns a general category that describes it (e.g., integrated development environment), along with attributes that further qualify it (commercial, php, etc.). By extension, the approach enables the dynamic creation of lists of all technologies of a given type (e.g., web application frameworks). Our approach relies on Stack Overflow and Wikipedia, and involves numerous original domain adaptations and a new solution to the problem of normalizing automatically-detected hypernyms. We compared [Formula Omitted] with six independent taxonomy tools and found that, when applied to software terms, [Formula Omitted] demonstrated better coverage than all evaluated alternative solutions, without a corresponding degradation in false positive rate. |
Author | Treude, Christoph Nassif, Mathieu Robillard, Martin P. |
Author_xml | – sequence: 1 givenname: Mathieu orcidid: 0000-0003-0211-7256 surname: Nassif fullname: Nassif, Mathieu email: mnassif@cs.mcgill.ca organization: School of Computer Science, McGill University, Montréal, QC, Canada – sequence: 2 givenname: Christoph orcidid: 0000-0002-6919-2149 surname: Treude fullname: Treude, Christoph email: christoph.treude@adelaide.edu.au organization: School of Computer Science, University of Adelaide, Adelaide, SA, Australia – sequence: 3 givenname: Martin P. surname: Robillard fullname: Robillard, Martin P. email: martin@cs.mcgill.ca organization: School of Computer Science, McGill University, Montréal, QC, Canada |
BookMark | eNp9kE1PAjEQhhuDiYDeTbyQeF6cfm3bIyH4kZB4AM9NLVMsWbbYXWLw17sE4sGDp7m8z8y8z4D06lQjIbcUxpSCeVguZmMGVI-Z5qWQcEH61HBTcMmgR_oARhdSanNFBk2zAQCplOwTmOzbtHVt9K6qDqOpa3GdcvyO9Xq0SKH9chlHS_QfdarSOmJzTS6Dqxq8Oc8heXucLafPxfz16WU6mReeGdoW2pdOrDwKVFQFL1RZmhJArNy70SCD1A6FYrAChsEF1f0cFGrNlOBs5YAPyf1p7y6nzz02rd2kfa67k5ZxbjjvytAuVZ5SPqemyRisj23XJtVtdrGyFOzRju3s2KMde7bTgfAH3OW4dfnwH3J3QiIi_sY1l4YLwX8AFGBw2A |
CODEN | IESEDJ |
CitedBy_id | crossref_primary_10_1109_TSE_2021_3059885 crossref_primary_10_1142_S0218194023500274 crossref_primary_10_1051_e3sconf_202451203022 crossref_primary_10_1007_s10664_021_09962_8 crossref_primary_10_1109_TSE_2020_3016006 crossref_primary_10_1007_s10664_020_09918_4 crossref_primary_10_1016_j_jss_2019_07_033 crossref_primary_10_1109_TSE_2021_3120203 |
Cites_doi | 10.1109/WCRE.2011.21 10.1145/2591062.2591071 10.1109/ICPC.2010.12 10.1007/978-3-642-35527-1_6 10.1093/ijl/3.4.235 10.3115/992133.992154 10.3115/v1/P14-5010 10.1145/1321440.1321475 10.1145/2063518.2063519 10.3115/1034678.1034705 10.3115/v1/P14-1089 10.1007/s10664-012-9231-y 10.1145/2124295.2124327 10.3115/1699571.1699634 10.1109/CSMR-WCRE.2014.6747213 10.1145/1458082.1458150 10.1109/ICSM.2012.6405332 10.1109/MSR.2013.6624040 10.1108/eb046814 10.1109/TSE.2010.91 10.1016/j.websem.2009.07.002 10.1007/s11704-013-2394-x 10.1145/2213836.2213891 10.2307/2529310 10.1109/MSR.2013.6624009 10.1007/s10664-013-9264-x |
ContentType | Journal Article |
Copyright | Copyright IEEE Computer Society 2020 |
Copyright_xml | – notice: Copyright IEEE Computer Society 2020 |
DBID | 97E RIA RIE AAYXX CITATION JQ2 K9. |
DOI | 10.1109/TSE.2018.2836450 |
DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Xplore CrossRef ProQuest Computer Science Collection ProQuest Health & Medical Complete (Alumni) |
DatabaseTitle | CrossRef ProQuest Health & Medical Complete (Alumni) ProQuest Computer Science Collection |
DatabaseTitleList | ProQuest Health & Medical Complete (Alumni) |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISSN | 1939-3520 |
EndPage | 32 |
ExternalDocumentID | 10_1109_TSE_2018_2836450 8359344 |
Genre | orig-research |
GrantInformation_xml | – fundername: Natural Sciences and Engineering Council of Canada |
GroupedDBID | --Z -DZ -~X .DC 0R~ 29I 4.4 5GY 6IK 85S 8R4 8R5 97E AAJGR AARMG AASAJ AAWTH ABAZT ABPPZ ABQJQ ABVLG ACGFO ACGOD ACIWK ACNCT AENEX AGQYO AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BKOMP BPEOZ CS3 DU5 EBS EDO EJD HZ~ I-F IEDLZ IFIPE IPLJI JAVBF LAI M43 MS~ O9- OCL P2P Q2X RIA RIE RNS RXW S10 TAE TN5 TWZ UHB UPT WH7 YZZ AAYXX ALIPV CITATION JQ2 K9. |
ID | FETCH-LOGICAL-c291t-8c6a4dce4e717fc476696004dab9805f58ae4720d02efaf7364f7e8827432da03 |
IEDL.DBID | RIE |
ISSN | 0098-5589 |
IngestDate | Mon Jun 30 11:06:01 EDT 2025 Thu Apr 24 23:10:43 EDT 2025 Tue Jul 01 01:53:16 EDT 2025 Wed Aug 27 02:41:20 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 1 |
Language | English |
License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c291t-8c6a4dce4e717fc476696004dab9805f58ae4720d02efaf7364f7e8827432da03 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ORCID | 0000-0003-0211-7256 0000-0002-6919-2149 |
PQID | 2339335581 |
PQPubID | 21418 |
PageCount | 13 |
ParticipantIDs | crossref_citationtrail_10_1109_TSE_2018_2836450 crossref_primary_10_1109_TSE_2018_2836450 ieee_primary_8359344 proquest_journals_2339335581 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2020-Jan.-1 2020-1-1 20200101 |
PublicationDateYYYYMMDD | 2020-01-01 |
PublicationDate_xml | – month: 01 year: 2020 text: 2020-Jan.-1 day: 01 |
PublicationDecade | 2020 |
PublicationPlace | New York |
PublicationPlace_xml | – name: New York |
PublicationTitle | IEEE transactions on software engineering |
PublicationTitleAbbrev | TSE |
PublicationYear | 2020 |
Publisher | IEEE IEEE Computer Society |
Publisher_xml | – name: IEEE – name: IEEE Computer Society |
References | ref13 ref37 levenshtein (ref18) 1966; 10 ref36 treude (ref38) 2011 ref31 ref11 ref10 ritter (ref30) 2009 ref1 manning (ref21) 1999 ref39 ref17 (ref7) 2012 ref19 stanley (ref34) 2013 de marneffe (ref8) 2008 snow (ref33) 2004 nakayama (ref27) 2008 ref24 ref45 ref23 ref26 ref25 ref42 ref41 zesch (ref46) 2008 ref22 ref44 (ref35) 2018 ref43 carvalho (ref5) 2012 ref28 ref29 dojchinovski (ref9) 2013 gravano (ref12) 2001; 24 bird (ref2) 2009 ref4 ref3 ref6 lo (ref20) 2012 kliegr (ref15) 2008 ref40 jaccard (ref14) 1901; 37 seitner (ref32) 2016 kozareva (ref16) 2010 |
References_xml | – start-page: 1297 year: 2004 ident: ref33 article-title: Learning syntactic patterns for automatic hypernym discovery publication-title: Proc 18th Annu Conf Neural Inf Process Syst – ident: ref28 doi: 10.1109/WCRE.2011.21 – year: 2012 ident: ref7 publication-title: DCMI Metadata terms – ident: ref37 doi: 10.1145/2591062.2591071 – start-page: 360 year: 2016 ident: ref32 article-title: A large database of hypernymy relations extracted from the web publication-title: Proc 10th edition Lang Resources Evaluation Conf – start-page: 1110 year: 2010 ident: ref16 article-title: A semi-supervised method to learn and construct taxonomies using the web publication-title: Proc Conf Empirical Methods Natural Language Process – ident: ref10 doi: 10.1109/ICPC.2010.12 – ident: ref19 doi: 10.1007/978-3-642-35527-1_6 – start-page: 600 year: 2012 ident: ref20 article-title: Detecting similar applications with collaborative tagging publication-title: Proc Int Conf Softw Maintenance – year: 2018 ident: ref35 – ident: ref25 doi: 10.1093/ijl/3.4.235 – ident: ref13 doi: 10.3115/992133.992154 – ident: ref22 doi: 10.3115/v1/P14-5010 – start-page: 38 year: 2008 ident: ref15 article-title: Wikipedia as the premiere source for targeted hypernym discovery publication-title: Proc Wikis Blogs Bookmarking Tools Mining Web 2 0 Workshop Eur Conf Mach Learn Principles Practice Knowl Discovery Databases – start-page: 654 year: 2013 ident: ref9 article-title: Entityclassifier.eu: Real-time classification of entities in text with Wikipedia publication-title: Machine Learning and Knowledge Discovery in Databases – year: 1999 ident: ref21 publication-title: Foundations of Statistical Natural Language Processing – ident: ref24 doi: 10.1145/1321440.1321475 – start-page: 804 year: 2011 ident: ref38 article-title: How do programmers ask and answer questions on the web? (NIER Track) publication-title: Proc 3rd Int Conf Software Engineering – start-page: 414 year: 2013 ident: ref34 article-title: Predicting tags for StackOverflow posts publication-title: Proc Int Conf Cognitive Model – volume: 37 start-page: 547 year: 1901 ident: ref14 article-title: Étude comparative de la distribution florale dans une portion des Alpes et des Jura publication-title: Bulletin de la Socit Vaudoise des Sciences Naturelles – volume: 24 start-page: 28 year: 2001 ident: ref12 article-title: Using q-grams in a DBMS for approximate string processing publication-title: IEEE Data Eng Bulletin – ident: ref23 doi: 10.1145/2063518.2063519 – ident: ref4 doi: 10.3115/1034678.1034705 – year: 2009 ident: ref2 publication-title: Natural Language Processing with Python Analyzing Text with the Natural Language Toolkit – ident: ref11 doi: 10.3115/v1/P14-1089 – ident: ref1 doi: 10.1007/s10664-012-9231-y – start-page: 59 year: 2008 ident: ref27 article-title: Wikipedia link structure and text mining for semantic relation extraction publication-title: Proc 5th Eur Semantic Web Conf Workshop Semantic Search – ident: ref6 doi: 10.1145/2124295.2124327 – ident: ref44 doi: 10.3115/1699571.1699634 – ident: ref36 doi: 10.1109/CSMR-WCRE.2014.6747213 – ident: ref26 doi: 10.1145/1458082.1458150 – ident: ref40 doi: 10.1109/ICSM.2012.6405332 – start-page: 1646 year: 2008 ident: ref46 article-title: Extracting lexical semantic knowledge from Wikipedia and Wiktionary publication-title: Proc Conf Language Resources Eval Electron Proc – ident: ref43 doi: 10.1109/MSR.2013.6624040 – ident: ref29 doi: 10.1108/eb046814 – start-page: 88 year: 2009 ident: ref30 article-title: What is this, anyway: Automatic hypernym discovery publication-title: Proceedings of the AAAI Spring Symposium on Learning by Reading and Learning to Read – ident: ref39 doi: 10.1109/TSE.2010.91 – ident: ref3 doi: 10.1016/j.websem.2009.07.002 – ident: ref41 doi: 10.1007/s11704-013-2394-x – ident: ref42 doi: 10.1145/2213836.2213891 – year: 2008 ident: ref8 publication-title: Stanford Typed Dependencies manual – ident: ref17 doi: 10.2307/2529310 – volume: 10 start-page: 707 year: 1966 ident: ref18 article-title: Binary codes capable of correcting deletions, insertions, and reversals publication-title: Soviet Physics Doklady – ident: ref31 doi: 10.1109/MSR.2013.6624009 – start-page: 239 year: 2012 ident: ref5 article-title: Probabilistic Synset based concept location publication-title: Proc 1st Symp Languages Appl Technol – ident: ref45 doi: 10.1007/s10664-013-9264-x |
SSID | ssj0005775 ssib053395008 |
Score | 2.3804638 |
Snippet | Informal language and the absence of a standard taxonomy for software technologies make it difficult to reliably analyze technology trends on discussion forums... |
SourceID | proquest crossref ieee |
SourceType | Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 20 |
SubjectTerms | Applications programs Electronic publishing Encyclopedias information retrieval Internet natural language processing Normalizing Software tagging Taxonomy Technology assessment Wikipedia |
Title | Automatically Categorizing Software Technologies |
URI | https://ieeexplore.ieee.org/document/8359344 https://www.proquest.com/docview/2339335581 |
Volume | 46 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEB3anrz4VcX6xR68CG6bZrObzbGUliLUS1vobclmJyCWVuoW0V_vZLtbRUW87SELYSbJvDeZvAG44RY1Q2WI5CjuC4wi18hd-6HTzrI2TYV1-Y7xQzSaift5OK_B3e4tDCIWxWfYdp_FXX62MhuXKusQWlCBEHWoE3HbvtX6LOeQMqz0McMwVtWVJFOd6WTgarjiNoXSSLgX9l9CUNFT5cdBXESX4QGMq3lti0qe2ps8bZv3b5KN_534IeyXMNPrbdfFEdRweQwHVQsHr9zRTWC9Tb4qVFv1YvHm9Z1uxGr9-E4BzZvQCf2q1-jt0u_Eqk9gNhxM-yO_bKLgG666uR-bSIvMoEAibtYIGUVEWpjIdKpiFtow1igkZxnjaLWVZCErkXA3QQueaRacQmO5WuIZeFJmNsYsThk3gmgcIRnCJ2hFxLBrJG9Bp7JrYkqFcdfoYpEUTIOphDyROE8kpSdacLv743mrrvHH2KYz7G5cadMWXFauS8rt95LwIFCBE47vnv_-1wXscUeci1zKJTTy9QavCF3k6XWxrD4AlvfKIw |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwED7xGGDhVRDlmYEFibSu48TxiCqq8igLReoWOc5ZQlQtKqkQ_fWc06QgQIgtgyNZd7bv-87n7wDOuEXNUBkiOYr7AqPINXLXfui0s6xNU2FdvqN3H3Ufxc0gHCzBxeItDCIWxWfYcJ_FXX42NlOXKmsSWlCBEMuwSnE_5PPXWp8FHVKGlUJmGMaqupRkqtl_uHJVXHGDgmkk3Bv7L0Go6Kry4ygu4ktnE3rVzOZlJc-NaZ42zOybaON_p74FGyXQ9C7nK2MblnC0A5tVEwev3NM1YJfTfFzoturh8N1rO-WI8eRpRiHNe6Az-k1P0Fsk4IlX78Jj56rf7vplGwXfcNXK_dhEWmQGBRJ1s0bIKCLawkSmUxWz0IaxRiE5yxhHq60kC1mJhLwJXPBMs2APVkbjEe6DJ2VmY8zilHEjiMgRliGEglZEDFtG8jo0K7smptQYd60uhknBNZhKyBOJ80RSeqIO54s_Xub6Gn-MrTnDLsaVNq3DUeW6pNyArwkPAhU46fjWwe9_ncJat9-7S-6u728PYZ07Gl1kVo5gJZ9M8ZiwRp6eFEvsA1PVzW0 |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Automatically+Categorizing+Software+Technologies&rft.jtitle=IEEE+transactions+on+software+engineering&rft.au=Nassif%2C+Mathieu&rft.au=Treude%2C+Christoph&rft.au=Robillard%2C+Martin+P&rft.date=2020-01-01&rft.pub=IEEE+Computer+Society&rft.issn=0098-5589&rft.eissn=1939-3520&rft.volume=46&rft.issue=1&rft.spage=20&rft_id=info:doi/10.1109%2FTSE.2018.2836450&rft.externalDBID=NO_FULL_TEXT |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0098-5589&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0098-5589&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0098-5589&client=summon |