Reinforcement learning for dummies like me

Reinforcement learning for dummies like me

Since we last talked, I went deeper into neural networks. Among other things, I worked through OpenAI's Spinning Up in Deep RL. If you want to learn deep RL, I highly recommend it. Personally, I had trouble with the gap between "here's the basic idea of reinforcement learning" and "here's the math behind RL." I'm going to try to explain that here, both to make sure I understand it and to hopefully help someone else it

I try to use the words I learned in Spinning Up to describe these concepts so they can be more easily mapped to the real math and code.

Keep in mind - I don't have a deep background in RL at all, so this might all be wrong.

The real, real basics of reinforcement learning

At an intuitive level, it's pretty easy to understand what RL is. It's like Pavlov's dogs - you give the computer a treat, and it presses the lever more. You scold the computer, and it stops. (But because computers are dumb, you don't even have to give it a treat - just give it a cool positive reward value and it'll be satisfied.) In RL, the thing you're training is called an agent.

A lot of RL research is done in video games, because they're controlled environments that humans can easily observe to see how the RL model is doing. But RL models are also used in robotics and, famously, in DeepMind's Go-playing AlphaZero.

So how do you do this with a neural network? For the uninitiated: neural networks are a bunch of math equations that take some data as input and try to make a prediction. They learn by trying to make a prediction, then seeing how far off their prediction is from the correct answer, then tweaking all the numbers in the math to get a little closer to the right answer.

In reinforcement learning you can't do this. The "right answer" you're trying to train the agent to learn is a series of actions to get the reward. For example, Pavlov's dogs need to look around, find the lever, walk over to it, and press it. Usually, the steps you're learning with RL are too complicated to specify by hand. If the behavior were simple enough to specify, then you could simply program a solution by hand and you wouldn't need reinforcement learning.

Since the right answer is hard to specify explicitly, reinforcement learning uses its experience finding the reward as an example to learn from. Often, it begins by exploring randomly, then it settles down on the best strategies it's found and tries them over and over again.

Step Up 2: The Streets

The second episode in this agent's attempt to learn to dance

Let's go a bit deeper into how the agent learns to hit the lever. RL agents usually train in a series of "episodes" or "trajectories." You can think of these like lives in a video game. The agent plays the game and then checks its score at the end before clicking "try again."

They also act in a series of steps. They look at the world (called an observation or a state), decide on an action, and then find out what the new state is and what reward it got. This is because the neural network that chooses the agent's actions needs to have a clear input and a clear output. The simplest way to do this is for it to take in the current state - say, the image on the screen of the video game - and put out an action - say, to turn left.

Together, these simplifications help the agent learn more complex behaviors. This is because the reward might happen way after the actions required to get the reward. If the agent only knows how to press a lever when it's next to the lever, it'll never learn to walk over. So the agent should learn to walk toward the lever, even though that action doesn't directly create a reward. The way it does this is by keeping track of the whole history of steps that happened before a reward and doing more of anything on the path that got it to the reward.

Let's pair up

Hopefully your critic is honest with you about the value function

Unlike classic NLP or CV problems, reinforcement learning actually creates its own training data by exploring. As far as I can tell, this makes it a lot more unstable to train. If the programmer messes up the exploration, the agent may never learn the desired behavior. It also means that RL agents have to learn which actions it took actually led to a reward vs which ones were superfluous. Did the dog receive a treat because it pressed a lever, or just because it's a very good boy?

The solution that RL algorithms from Spinning Up involves learning a second neural network that tries to predict the reward in any given scenario. The reward neural network is called a "critic" and the one that chooses what to do is called an "actor." The critic can be trained in a traditional NN style - it looks at all the things that have ever happened in the agent's experience and the rewards the agent received and uses this to predict future rewards. The critic can be used in two ways:

  1. "On-policy:" The critic tells the actor how well other actions it could have done would have been - so the actor can see, basically, how much better it's doing than the average outcome for this situation (which is called the "advantage"). Then, the actor's neural network can learn to adjust its weights to do more of the thing that gave the reward, or less of the thing that didn't. Both the actor and the critic learn at the same time, so as the critic gets better at understanding where the rewards are, the actor gets better at finding them.
  2. "Off-policy": The actor learns from replaying past actions against the critic, who will tell the actor whether we've learned something new that tells us if it was actually a good action. For example, maybe there's a shorter path to the lever than the one the agent took the first time it found the lever. The actor might replay a past simulated action along that path that wasn't successful, and the critic might say "oh actually that looks promising," so the actor can try it out later in the real world.

Can you explain on-policy vs off-policy one more time for me?

I found these confusing to keep track of because I didn't find their names intuitive, so I'm going to dive deeper into this.

There are a number of algorithms of each type, although they behave generally the same way.

On-policy means that the actor learns a behavior, then tests it out for a while, then updates the actor based on what it learned, then updates the critic based on what it learned. It's called "on-policy" because the agent only learns as a result of the actions it takes in the "real" world, and the agent is sometimes called the "policy" in RL.

Off-policy means that, after it's learned for a while, the actor replays its past experience, but with new expected rewards based on what the critic has learned. It will try these old actions against the critic, and the critic may have learned that these are better or worse than they originally thought (because they now know about new actions that can be taken after that initial action). It's called "off-policy" because the agent learns from simulated actions that the agent (policy) itself would never pick.

What difference does this make, practically? An on-policy approach is not going to mistakenly guess that an action leads to a reward because the critic said it would - it only learns when it actually gets the reward in the "real" world. But that might take a long time. Also, since an on-policy approach only tries things that the agent has already learned to do, it might ignore new paths that the critic thinks are promising and that it ought to try out. This can lead to better outcomes for off-policy learning.

You don't have to take my word for it

Find this useful or interesting? Don't thank me - thank Joshua Achiam at OpenAI, who put Spinning Up together! And thank Brian Christian for The Alignment Problem too, whose explanation of RL was really helpful for getting context and informing my understanding here. A lot of what I say here is just a restatement of their work. (Also, please fact check my work by going straight to the source.)