Content-Based Methods for Predicting Web-Site Demographic Attributes

Demographic information plays an important role in gaining valuable insights about a web-site's user-base and is used extensively to target online advertisements and promotions. This paper investigates machine-learning approaches for predicting the demographic attributes of web-sites using info...

Full description

Saved in:
Bibliographic Details
Published in2010 IEEE International Conference on Data Mining pp. 863 - 868
Main Authors Kabbur, Santosh, Eui-Hong Han, Karypis, George
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2010
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Demographic information plays an important role in gaining valuable insights about a web-site's user-base and is used extensively to target online advertisements and promotions. This paper investigates machine-learning approaches for predicting the demographic attributes of web-sites using information derived from their content and their hyper linked structure and not relying on any information directly or indirectly obtained from the web-site's users. Such methods are important because users are becoming increasingly more concerned about sharing their personal and behavioral information on the Internet. Regression-based approaches are developed and studied for predicting demographic attributes that utilize different content-derived features, different ways of building the prediction models, and different ways of aggregating web-page level predictions that take into account the web's hyper linked structure. In addition, a matrix-approximation based approach is developed for coupling the predictions of individual regression models into a model designed to predict the probability mass function of the attribute. Extensive experiments show that these methods are able to achieve an RMSE of 8-10% and provide insights on how to best train and apply such models.
ISBN:9781424491315
1424491312
ISSN:1550-4786
2374-8486
DOI:10.1109/ICDM.2010.97