Masque of Awaleb
Golden Mandate of the Stormborn
Genuine Kantusa the Script Sword
Arms of Desolation
Origins of Faith
Great Sage’s Reckoning
Swine of the Sunken Galley
Pauldrons of the Demon Trickster
Sea Rake’s Bridle
Jagged Honor Legs
Genuine Eternal Radiance Blades
Golden Mantle of Grim Facade
Dark Artistry Belt
Staff of the Demon Trickster
Golden Full-Bore Bonanza
Remnants of Ascension
Lash of the Lizard Kin
Avatar of the Impossible Realm
Desert Burn Saddle
Thirst of Eztzhok — Off-Hand
Inscribed Blades of Voth Domosh
Inscribed Fractal Horns of Inner Abysm
Inscribed Staff of Perplex
Provocation of Ruin Mask
Colar of the Ardalan Interdictor
Unusual Ageless Apothecary
Provocation of Ruin Bracers
Frosty the Sew-Man
Inscribed Turstarkuri Pilgrim Head
Golden Latticean Shards
Golden Infernal Chieftain
Genuine Kantusa the Script Sword
Genuine Golden Nothlic Burden
Whispers of the Damned
Golden Immortal Pantheon
Masque of Awaleb
Golden Mandate of the Stormborn
Genuine Kantusa the Script Sword
Arms of Desolation
Origins of Faith
Great Sage’s Reckoning
Swine of the Sunken Galley
Pauldrons of the Demon Trickster
Sea Rake’s Bridle
Jagged Honor Legs
Genuine Eternal Radiance Blades
Golden Mantle of Grim Facade
Dark Artistry Belt
Staff of the Demon Trickster
Golden Full-Bore Bonanza
Remnants of Ascension
Lash of the Lizard Kin
Avatar of the Impossible Realm
Desert Burn Saddle
Thirst of Eztzhok — Off-Hand
Inscribed Blades of Voth Domosh
Inscribed Fractal Horns of Inner Abysm
Inscribed Staff of Perplex
Provocation of Ruin Mask
Colar of the Ardalan Interdictor
Unusual Ageless Apothecary
Provocation of Ruin Bracers
Frosty the Sew-Man
Inscribed Turstarkuri Pilgrim Head
Golden Latticean Shards
Golden Infernal Chieftain
Genuine Kantusa the Script Sword
Genuine Golden Nothlic Burden
Whispers of the Damned
Golden Immortal Pantheon
Dota-2: Dota-2 with large scale reinforcement learning
Dota-2 is a multi-players real-time strategy game (RTS), which is played on a squared map with two teams locating on diagonal corners. Each team have 5 players, each controls a hero with specific skills. On each team also have a set of creeps which is not controllable but attach opponent automatically. Players can earn gold coin on killing opponent’s creeps then upgrade skill and items.
Main challenges of Dota-2 for RL
- Long-time horizon: 30 frames per second last for 45mins, roughly 20,000 step per episode.
- Partially-observed state: players only see nearby environment
- High-dimensional state space: 16,000 valued state vector
- High-dimensional action space: valid action number range from 8,000 to 80,000 each step
- 4 frames per action
- discrete action return by RL
- certain game mechanics are hand-scripted rather than controlled by RL policy
- some properties of the enviroment were randomized to ensure sufficiently diverse training games for robustness.
State space and encoding:
- instead of using pixels on the screen, we organized info into a set of data array
- all float info and booleans are normalized before feeding in to neural network, we also keep running mean and standard deviation of all data ever observed.
- after normalized by mean and std, state are clipped between [-5, 5]
- a primary action (30): noop, move, attach, activate spell, activate items, etc
- a set of parametric action: delay(4), unit selection(189), offset(80)
- action mask to filter valid action per step
- factored action space: 30x4x189x81=1,837,080
- Some actions are scripted, following engineering practice, we start from as small set of action for RL and increase gradually.
- scripted actions: ability builds, item purchasing, item swap and courier control
- win the game
- we given reward for a set of actions which are good for human players.
- all rewards are zero-sum, subtracting from each hero’s reward the average of opponent’s rewards
- game time weighting: rewards are much bigger in magnitude in plater game phase due to more skillful hero than early phase. Policy will focus on later phase learning and ignored earlier stages. To avoid this, reward is normalized according to time step
Although, tau=1 is the ultimate goal (team win), we find that lower tau reduce gradient variance in early training, which leads to clearer rewards for learning mechanical and tactical ability.
Neural network artichecture
- 158,502,815 parameters in total (policy and value function) 0.15B
- observations are processed and pooled into a single vector summarizing the state
- single layer of LSTM
- output of LSTM are projected (linear projection) into action heads and value function head.
- rollout worker(51,2000 CPU for game runing)
- 512 GPUs for network inference (forward pass)
- 512 GPUs for training
- 10 months training time