Keyword Extraction from Company Websites for the Development of Regional Knowledge Maps

Regional Innovation Systems describe the relations between actors, structures and infrastructures in a region in order to stimulate innovation and regional development. For these systems the collection and organization of information is crucial. In the present paper we investigate the possibilities...

Full description

Saved in:

Bibliographic Details
Published in	Knowledge Discovery, Knowledge Engineering and Knowledge Management Vol. 454; pp. 96 - 111
Main Authors	Wartena, Christian, Garcia-Alsina, Montserrat
Format	Book Chapter
Language	English
Published	Germany Springer Berlin / Heidelberg 2015 Springer Berlin Heidelberg
Series	Communications in Computer and Information Science
Subjects	Artificial intelligence Data mining Economic Sector Information retrieval Inverse Document Frequency National Innovation System Regional Innovation System Weighting Scheme
Online Access	Get full text
ISBN	3662465485 9783662465486
ISSN	1865-0929 1865-0937
DOI	10.1007/978-3-662-46549-3_7

Cover

More Information
Summary:	Regional Innovation Systems describe the relations between actors, structures and infrastructures in a region in order to stimulate innovation and regional development. For these systems the collection and organization of information is crucial. In the present paper we investigate the possibilities to extract information from websites of companies. Especially we consider faceted classification of companies by keyword extraction using a specialized thesaurus. First we identify a number of challenges that arise when we want to extract information about companies from their websites. Then we describe a small scale experiment in which keywords related to economic sectors and commodities are extracted from the websites of over 200 companies. The experiment shows that the approach is at least feasible for the commodities facet. For the sectors facet the simple keyword extraction methods used do not perform well. We find that a good coverage of words in the text by the thesaurus is crucial and that hence the results can be improved by adding more alternative labels to the thesaurus terms. Furthermore, we find that weighting terms according to their relations to other terms on the website instead of using inverse document frequency gives better results than the classical tf.idf weighting of terms.
ISBN:	3662465485 9783662465486
ISSN:	1865-0929 1865-0937
DOI:	10.1007/978-3-662-46549-3_7