Dueling Bandits with Delayed Feedback

Authors

  • Jasmin Brandt
  • Björn Haddenhorst
  • Viktor Bengs
  • Eyke Hüllermeier

DOI:

https://doi.org/10.11576/dataninja-1163

Keywords:

Multi-Armed Bandits, Dueling Bandits, Online Learning

Abstract

Dueling Bandits is a well-studied extension of the Multi-Armed Bandits problem, in which the learner must select two arms in each time step and receives a binary feedback as an outcome of the chosen duel. However, all of the existing best arm identification algorithms for the Dueling Bandits setting assume that the feedback can be observed immediately after selecting the two arms. If this is not the case, the algorithms simply do nothing and wait until the feedback of the recent duel can be observed, which is a waste of runtime. We propose an algorithm that can already start a new duel even if the previous one is not finished and thus is much more time efficient. Our arm selection strategy balances the expected information gain of the chosen duel and the expected delay until we observe the feedback. By theoretically grounded confidence bounds we can ensure that the arms we discard are not the best arms with high probability.

Downloads

Published

2024-10-11