Keyword Extraction from Company Websites for the Development of Regional Knowledge Maps
Regional Innovation Systems describe the relations between actors, structures and infrastructures in a region in order to stimulate innovation and regional development. For these systems the collection and organization of information is crucial. In the present paper we investigate the possibilities...
Saved in:
Published in | Knowledge Discovery, Knowledge Engineering and Knowledge Management Vol. 454; pp. 96 - 111 |
---|---|
Main Authors | , |
Format | Book Chapter |
Language | English |
Published |
Germany
Springer Berlin / Heidelberg
2015
Springer Berlin Heidelberg |
Series | Communications in Computer and Information Science |
Subjects | |
Online Access | Get full text |
ISBN | 3662465485 9783662465486 |
ISSN | 1865-0929 1865-0937 |
DOI | 10.1007/978-3-662-46549-3_7 |
Cover
Summary: | Regional Innovation Systems describe the relations between actors, structures and infrastructures in a region in order to stimulate innovation and regional development. For these systems the collection and organization of information is crucial. In the present paper we investigate the possibilities to extract information from websites of companies. Especially we consider faceted classification of companies by keyword extraction using a specialized thesaurus. First we identify a number of challenges that arise when we want to extract information about companies from their websites. Then we describe a small scale experiment in which keywords related to economic sectors and commodities are extracted from the websites of over 200 companies. The experiment shows that the approach is at least feasible for the commodities facet. For the sectors facet the simple keyword extraction methods used do not perform well. We find that a good coverage of words in the text by the thesaurus is crucial and that hence the results can be improved by adding more alternative labels to the thesaurus terms. Furthermore, we find that weighting terms according to their relations to other terms on the website instead of using inverse document frequency gives better results than the classical tf.idf weighting of terms. |
---|---|
ISBN: | 3662465485 9783662465486 |
ISSN: | 1865-0929 1865-0937 |
DOI: | 10.1007/978-3-662-46549-3_7 |