Understanding OpenAI Five

In this blog ( my first blog on medium o_o ), I will explain the challenges in making a Dota bot, appreciate how OpenAI addressed these challenges, and lay out its fundamental flaws. I hope that by explaining how it works in layman terms, we can watch its upcoming matches with the right mindset.

Содержание

The OpenAI Five Problem
Learning Is Better Than Programming
Challenges of RL
The OpenAI Solution — Reward Shaping
The OpenAI Muscle — Self Play
The Deceit of Breadcrumbs
The Dilemma Of RL

The OpenAI Five Problem

It is powerful as it creates a tangible spectacle and H Y P E. Like how Alpha Go pwnt go, OpenAI would like to pwn Dota. Anyone would be very proud with an achievement of the form: we are the first ___ that beat humans in ___ .

The key is that we need to evaluate what problem did OpenAI actually managed to solve, rather than be misled into thinking OpenAI solved a more challenging problem. From reading their blog, OpenAI is indeed very careful to not overstate their achievements. Yet, it has every reason to hope the public over-hype their achievements out of proportion with sensationalism.

Learning Is Better Than Programming

Simply put, an AI agent is playing Dota well if it acts rationally for every game-state it encounters. For a human, we understand intuitively that taking last-hits is in general a good action, but taking last-hits when a fight is happening is generally a bad action. However, transferring this intuition to an AI agent has been notoriously challenging since the dawn of AI.

Новости: АРЕНДА ЗЕМЛИ

The forefathers of AI erroneously equated AI with programming. “We’ll just hard-code every single behaviour of the agent”, they had thought. The result is mammoth efforts trying to program every possible interaction the agent might encounter. However, there will always be some interactions that were unanticipated, and these hard-coded agents failed in spectacular fashions.

Rather than explicitly programming the agent’s behaviours, Reinforcement Learning (RL), a sub-field of AI, opts for a different approach: Let the AI interacts with the game environment on its own and learn what are the best actions. In recent years, RL has shown indisputable results, such as defeating the best Go player and besting a wide range of Atari games. The roots of RL stems from behaviourism psychology, which states that all behaviours can be encouraged or discouraged with the proper stimulus (reward / punishment). Indeed, you can teach pigeons how to play ping pong using RL.

Challenges of RL

Applying RL to Dota, however, has some considerable challenges:

Long horizon — A key challenging in RL is you often only obtain a reward signal after executing a long and complex sequence of actions. In Dota, you need to last hit, use the gold to buy items, pwn scrubs, and make a push before finally destroying the ancient, thereby obtaining a reward from winning. However, at the start the agent knows nothing about Dota, and acts at random. The chances of it randomly winning the game is 0. As a result, the agent never observes any positive reinforcements and learns nothing.

In Dota, the game horizon is long. The chance of winning by acting randomly is infinitesimally small

Credit assignments — Even when ancient is destroyed, which actions are actually responsible for it? Was it hitting the tower, or using a truckload of mangoes with full mana? The judgement on which specific actions (out of a long sequence of executed actions) are responsible for your victory is the credit assignment problem. Without any prior knowledge, your best bet is a uniform assignment scheme: credit all actions that resulted in a victory, and discredit all actions that resulted in a defeat, hoping the right actions are credited more often on average. This approach works on a short games like Pong and 1v1 Dota, and is the optimal approach if you can afford the computations and patience. Indeed, AlphaGo Zero was entirely trained in this fashion, with only +1 and -1 reward signal for winning and losing the game. For Dota though, there are simply too many actions to account for, and OpenAI decided it is best to coach our pigeon more directly.

Which action the agent took actually contributed to winning the game? Without any game knowledge, the best one can do is evenly credit all the actions. Here, the agent will erroneously associate both last-hitting and dying with winning the game.

The OpenAI Solution — Reward Shaping

One pragmatic way of addressing the challenge of long horizon and credit assignments is reward shaping, where one breaks down the eventual reward into small pieces, to directly encourage the right behaviours at each step. The best way of explaining reward shaping is by watching this pigeon training video. The pigeon would never spontaneously spin around, but by rewarding each small step of a turn, the trainer slowly coax the pigeon into the correct behaviour. In OpenAI Five, rather than learning last-hit is indirectly responsible for the ultimate victory, a last-hit is directly rewarded with a score of 0.16, whereas dying is punished with a negative score of -1. The agent would immediately learn that last-hit is good while dying is bad, irrespective to the ultimate victory of a game. Here is the full list of shaped reward values.

By associating short-term goals like last-hitting and dying with immediate reward and punishment our pigeon can be coaxed into the right behaviour

The challenge of reward-shaping is that only certain behaviours can be effectively shaped. Killing and last-hitting have immediate benefits, so intuitive scores can be assigned to each when these events occur. However, compared to last-hitting, the score of ward and smoke usages are very nebulous, something even the OpenAI researchers have not a good answer to.

The OpenAI Muscle — Self Play

Everyone has catch phrases. My adviser tends to say “sounds like a plan!” at the end of our meetings. OpenAI too has catch phrases, and one of them is this : “How can we frame a problem in such a way that, by simply throwing more and more computers at it, the solution gets better and better?” One of the answer they have settled down is that of self-play, you can watch an explanation by Ilya Sutskever here. The two take-away from the talk are self-play turns computes into data and self-play induces the right curriculum.

By playing against itself in the task of maximising short-term rewards, the pigeons learns how to last-hit and not dying

Turning computes into data — With self-play, one can spawn thousands of copies game environments, and dynamically generate the training data by interacting with the environment. Self-play is purely bound by the amount of computes one can muscle. And if computes are muscles, OpenAI is on steroid.

Inducing the right curriculum — A baby in a college level class will learn nothing. Training an agent is often no different, it is easier to train an agent by first allowing it to accomplish a set of simple tasks, and gradually increasing the complexity (a curriculum) until it finally learns the set of complex tasks. Competitive self-play naturally induces a curriculum in increasing difficulty: In the beginning, the task of beating yourself is easy, as you are bad at the game. But as you get better, it gets harder and harder to beat yourself.

Since June 9th, at 180 years of Dota games played per day, OpenAI has played 10000 years of Dota using self-play, which is longer than the existence of human civilisation. Just let that sink in for a second.

The Deceit of Breadcrumbs

To recap, OpenAI carefully constructed a trail of breadcrumbs of short-term rewards that the pigeon, evolved through 10000 years of Dota playing, is an expert at obtaining: Pigeon sees creep, nom nom; Pigeon sees you, kills you, nom nom; Pigeons at your base after killing you, sees your buildings, nom nom and you lose the game. — TL;DR of OpenAI bot

Since the OpenAI agents are trained to maximise short-term rewards, the concept of winning is literally under a smoke, and won’t become visible until the AI is sufficiently close to it. This makes the agent oblivious to long term strategic maneuvers such as forming a push around an important item timing.

The pigeon, occupied with maximising short-term rewards, does not see the victory far in the distance