Thesis of Victor Gabillon

Budgeted Classification-based Policy Iteration

This dissertation is motivated by the study of a class of reinforcement learning (RL) algorithms, called classification-based policy iteration (CBPI). Contrary to the standard RL methods, CBPI do not use an explicit representation for value function. Instead, they use rollouts and estimate the action-value function of the current policy at a collection of states. Using a training set built from these rollout estimates, the greedy policy is learned as the output of a classifier. Thus, the policy generated at each iteration of the algorithm, is no longer defined by a (approximated) value function, but instead by a classifier. In this thesis, we propose new algorithms that improve the performance of the existing CBPI methods, especially when they have a fixed budget of interaction with the environment. Our improvements are based on the following two shortcomings of the existing CBPI algorithms: 1) The rollouts that are used to estimate the action-value functions should be truncated and their number is limited, and thus, we have to deal with bias-variance tradeoff in estimating the rollouts, and 2) The rollouts are allocated uniformly over the states in the rollout set and the available actions, while a smarter allocation strategy could guarantee a more accurate training set for the classifier. We propose CBPI algorithms that address these issues, respectively, by: 1) the use of a value function approximation to improve the accuracy (balancing the bias and variance) of the rollout estimates, and 2) adaptively sampling the rollouts over the state-action pairs.

Jury

Directeur de Thèse : M. Mohammad GHAVAMZADEH, Inria Lille & Adobe Directeur de Thèse : M. Philippe PREUX, Université Lille 3 Rapporteurs : M. Peter AUER, Université de Leoben Rapporteurs : M. Shie MANNOR, Technion Membres : M. Olivier CAPPÉ, Télécom ParisTech Membres : M. Csaba SZEPESVÁRI, Université de l'Alberta

Thesis of the team defended on 12/06/2014