Evaluating distance-based clustering for user (browse and click) sessions in a domain-specific collection

We seek to improve information retrieval in a domain-specific collection by clustering user sessions from a click log and then classifying later user sessions in real time. As a preliminary step, we explore the main assumption of this approach: whether user sessions in such a site are related to the...

Full description

Saved in:

Bibliographic Details
Published in	International journal on digital libraries Vol. 14; no. 3-4; pp. 167 - 179
Main Authors	Steinhauer, Jeremy, Delcambre, Lois M. L., Lykke, Marianne, Ådland, Marit Kristine
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.08.2014 Springer Nature B.V
Subjects	Algorithms Classification Cluster analysis Clustering Collection Computer Science Database Management Digital libraries Digital systems Information retrieval Information Systems and Communication Service Machine learning Tasks Users Evaluation Clustering Mechanical Turk User study Distance measure
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We seek to improve information retrieval in a domain-specific collection by clustering user sessions from a click log and then classifying later user sessions in real time. As a preliminary step, we explore the main assumption of this approach: whether user sessions in such a site are related to the question that they are answering. Since a large class of machine learning algorithms use a distance measure at the core, we evaluate the suitability of common machine learning distance measures to distinguish sessions of users searching for the answer to same or different questions. We found that two distance measures work very well for our task and three others do not. As a further step, we then investigate how effective the distance measures are when used in clustering. For our dataset, we conducted a user study where we had multiple users answer the same set of questions. This data, grouped by question, was used as our gold standard for evaluating the clusters produced by the clustering algorithms. We found that the observed difference between the two classes of distance measures affected the quality of the clusterings, as expected. We also found that one of the two distance measures that worked well to differentiate sessions, worked significantly better than the other when clustering. Finally, we discuss why some distance metrics performed better than others in the two parts of our work.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
ISSN:	1432-5012 1432-1300
DOI:	10.1007/s00799-014-0117-z