Introduction to Thompson Sampling

10 May

Introduction to Thompson Sampling

Reinforcement Learning is a branch of Machine Learning, also called Online Learning. It is used to decide what action to take at t+1 based on data up to time t. This concept is used in Artificial Intelligence applications such as walking. A popular example of reinforcement learning is a chess engine. Here, the agent decides upon a series of moves depending on the state of the board (the environment), and the reward can be defined as win or lose at the end of the game.

Thompson Sampling (Posterior Sampling or Probability Matching) is an algorithm for choosing the actions that address the exploration-exploitation dilemma in multi-armed bandit problem. Actions are performed several times and are called exploration. It uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit trial-and-error search for good behaviour. Based on the results of those actions, rewards (1) or penalties (0) are given for that action to the machine. Further actions are performed in order to maximize the reward that may improve future performance. Suppose a robot has to pick several cans and put in a container. Each time it puts the can to the container, it will memorize the steps followed and train itself to perform the task with better speed and precision (reward). If the Robot is not able to put the can in the container, it will not memorize that procedure (hence speed and performance will not improve) and will be considered as a penalty.

Thompson Sampling has an advantage of the tendency to decrease the search as we get more and more information, which mimics the desirable trade-off in the problem, where we want as much information as possible in fewer searches. Hence, this Algorithm has a tendency to be more “search-oriented” when we have fewer data and less “search-oriented” when we have a lot of data.

Multi-Armed Bandit Problem
Multi-armed Bandit is synonymous to a slot machine with many arms. Each action selection is like a play of one of the slot machine’s levers, and the rewards are the payoffs for hitting the jackpot. Through repeated action selections you are to maximize your winnings by concentrating your actions on the best levers. Each machine provides a different reward from a probability distribution over mean reward specific to the machine. Without knowing these probabilities, the gambler has to maximize the sum of reward earned through a sequence of arms pull. If you maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call this a greedy action. The analogy to this problem can be advertisement displayed whenever the user visits a webpage. Arms are ads displayed to the users each time they connect to a web page. Each time a user connects to the page makes around. At each round, we choose one ad to display to the user. At each round n, ad i gives reward ri(n) ε {0, 1}: ri(n)=1 if the user clicked on the ad i, 0 if the user didn’t. The goal of the algorithm will be to maximize the reward. Another analogy is that of a doctor choosing between experimental treatments for a series of seriously ill patients. Each action selection is a treatment selection, and each reward is the survival or well-being of the patient.

Algorithm

Some Practical Applications

Netflix Item based recommender systems: Images related to movies/shows are shown to users in such a way that they are more likely to watch it.
Bidding and Stock Exchange: Predicting Stocks based on Current data of stock prizes.
Traffic Light Control: Predicting the delay in signal.
Automation in Industries: Bots and Machines for transporting and Delivering items without human intervention.

Course Curriculum

Introduction to Thompson Sampling