Keyword Extraction from Company Websites for the Development of Regional Knowledge Maps

Regional Innovation Systems describe the relations between actors, structures and infrastructures in a region in order to stimulate innovation and regional development. For these systems the collection and organization of information is crucial. In the present paper we investigate the possibilities...

Full description

Saved in:
Bibliographic Details
Published inKnowledge Discovery, Knowledge Engineering and Knowledge Management Vol. 454; pp. 96 - 111
Main Authors Wartena, Christian, Garcia-Alsina, Montserrat
Format Book Chapter
LanguageEnglish
Published Germany Springer Berlin / Heidelberg 2015
Springer Berlin Heidelberg
SeriesCommunications in Computer and Information Science
Subjects
Online AccessGet full text
ISBN3662465485
9783662465486
ISSN1865-0929
1865-0937
DOI10.1007/978-3-662-46549-3_7

Cover

More Information
Summary:Regional Innovation Systems describe the relations between actors, structures and infrastructures in a region in order to stimulate innovation and regional development. For these systems the collection and organization of information is crucial. In the present paper we investigate the possibilities to extract information from websites of companies. Especially we consider faceted classification of companies by keyword extraction using a specialized thesaurus. First we identify a number of challenges that arise when we want to extract information about companies from their websites. Then we describe a small scale experiment in which keywords related to economic sectors and commodities are extracted from the websites of over 200 companies. The experiment shows that the approach is at least feasible for the commodities facet. For the sectors facet the simple keyword extraction methods used do not perform well. We find that a good coverage of words in the text by the thesaurus is crucial and that hence the results can be improved by adding more alternative labels to the thesaurus terms. Furthermore, we find that weighting terms according to their relations to other terms on the website instead of using inverse document frequency gives better results than the classical tf.idf weighting of terms.
ISBN:3662465485
9783662465486
ISSN:1865-0929
1865-0937
DOI:10.1007/978-3-662-46549-3_7