The multi-armed bandit (MAB) problem is a mathematical formulation of the exploration-exploitation trade-off in reinforcement learning in which the learner chooses an arm from a set of available arms in a sequence of trials in order to maximise their reward. In the classical MAB problem, the learner receives absolute bandit feedback i.e. it receives as feedback the reward of the arm it selects. In many practical situations however, different kind of feedback is more readily available. In this thesis, we study two of such kinds of feedback, namely, relative feedback and corrupt feedback.
Directeur de thèse : Philippe Preux, Professeur, Université de Lille, Villeneuve d’Ascq Co-encadrant : Tanguy Urvoy, chercheur, Orange Labs, Lannion Rapporteurs : Aurélien Garivier, Professeur, Institut de Mathématiques de Toulouse, Université Paul Sabatier, Toulouse Maarten de Rijke, Professor, University of Amsterdam Examinateurs : Alexandra Carpentier, chercheur, Institut für Mathematik, Universität Postdam Richard Combes, Centrale-Supélec, Saclay Emilie Kaufmann, chercheur, CNRS, CRIStAL Gabor Lugosi, University Pompeu Fabra, Barcelone