I will be going to the NIPS 2011 workshops next month in Granada, to present my PhD work!
In bandit problems with many arms for which feature representations are given, we seek to learn a mapping from arm feature vectors to mean reward values. The LinRel algorithm, for instance, looks for linear mappings, and has also been extended to perform kernel Ridge Regression. Alternatively, Gaussian Processes can be used to model a prior belief on the smoothness of the mean-reward function f: it is likely that playing similar arms will yield similar rewards. The setting resembles that of GP noisy optimization, where f is the target and the reward variability is seen as noise.
In this talk, I will first review the GP-UCB algorithm that makes use of the principle of Optimism in the Face of Uncertainty in order to choose arms to play (i.e. samples to acquire, in the optimization setting). I will focus on the case where the number of arms is finite. We will then see how GPs can be used to model the smoothness of functions defined on tree paths, with covariances between paths based on the number of nodes in common, and we will see how a tree search problem can be approached with a single bandit algorithm for which tree paths are arms. The resulting algorithm, GPTS, can be made tractable whatever the size of the branching factor of the tree (which, with the maximum depth, determines the number of arms in the bandit problem). Finally, I will discuss the advantages and shortcomings of GPTS in comparison with ”many-bandits” approaches to tree search, such as UCT: GPTS better deals with large branching factors, but has a higher computational cost.