Reinforcement Learning Techniques and Challenges

Action and Reward

The simplest form of reinforcement learning involves an agent taking an action and observing a reward or penalty based on that action. This type of problem can be addressed with what are called multi-armed bandit models. For example, choosing a restaurant is an action, and the rating based on the experience is the reward. That rating (along with previous scores) affects how likely the agent will choose that restaurant in the future.

Markov Decision Process Models

Markov decision process models address more sophisticated reinforcement learning problems. Markov decision process models involve sequential decision making under uncertainty. Given a state of the environment, the agent selects an action, collects a reward or penalty, and then transitions to a new state. The process then continues until the “game” ends. Card games such as Blackjack/21, playing a video game with joystick actions, and robots navigating mazes are all good examples of Markov decision processes. They have well-defined states, actions, rewards, and state transitions. The goal is to learn an effective policy that identifies the best action to take given a particular state.

Challenges and Alternatives

Solving a Markov decision process depends on what the agent knows about the reward and state transitions. Given a state and action, suppose the agent knows the probability distribution that defines the future reward and state transition. In that case, stochastic, dynamic programming methods can identify the optimal action for any particular state. Stochastic, dynamic programming problems can be solved exactly given the state and action spaces, along with the reward and state transition models.

If the agent lacks a reward or state transition model, then the learning must be done through repeated simulated play. Given a state and action, the agent observes a reward and observes the future state. Algorithms such as Q-learning use this information to update the value of the state-action combination. Eventually, the agent will learn effective actions to take given the current state, but this learning may take a long time.