Counterfactual contextual bandit for recommendation under delayed feedback

The recommendation system has far-reaching significance and great practical value, which alleviates people’s troubles about choosing from a huge amount of information. The existing recommendation system usually faces the selection bias problem due to the ignorance of samples with delayed feedback. T...

Full description

Saved in:
Bibliographic Details
Published inNeural computing & applications Vol. 36; no. 23; pp. 14599 - 14613
Main Authors Cai, Ruichu, Lu, Ruming, Chen, Wei, Hao, Zhifeng
Format Journal Article
LanguageEnglish
Published London Springer London 01.08.2024
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The recommendation system has far-reaching significance and great practical value, which alleviates people’s troubles about choosing from a huge amount of information. The existing recommendation system usually faces the selection bias problem due to the ignorance of samples with delayed feedback. To alleviate this problem, by modeling the recommendation as a batch contextual bandit problem, we propose a counterfactual reward estimation approach in this work. First, we formalize the counterfactual problem as “would the user be interested in the recommended item if the delayed time is before the collection time point?". The above counterfactual reward is estimated in a survival analysis framework, by fully exploring the causal generation process of user feedback on batch data. Second, based on the above estimated counterfactual rewards, the policy of batch contextual bandit is updated for online recommendation in the next episode. Third, new batch data are generated in the online recommendation for further counterfactual reward estimation. The above three steps are iteratively conducted until the optimal policy is learned. We also prove the sub-linear regret bound of the learned bandit policy theoretically. Our method achieved a 4 % improvement in average reward compared to the baseline methods in experiments conducted on synthetic and Criteo datasets, demonstrating the efficacy of our approach.
ISSN:0941-0643
1433-3058
DOI:10.1007/s00521-024-09800-0