Procedural Content Generation With Deep Q-Learning

Matthew MacFarquhar
7 min readSep 26, 2024

--

Introduction

I have been reading this book on Procedural Content Generation and its integration with ML. As I go through this book, I will try my best to implement the concepts we go over in small, easily implementable examples. In this chapter, we will explore using Reinforcement Learning to teach an agent to create Monster stats for a game which will win against our target opponent 50% of the time using Deep Q-Learning (creating a balanced opponent).

The Notebooks used in this article can be found in the below repository.

Deep Q-Learning

Q-learning is a reinforcement learning algorithm that learns the value (Q-value) of taking a specific action in a given state by updating a table based on rewards received from the environment. In Q-learning, this table represents the policy that tells the agent what action to take for each state. Deep Q-learning (DQN) replaces this table with a neural network that approximates the Q-values, enabling the algorithm to handle more complex environments with continuous or high-dimensional state spaces.

The Gym Environment

We will be using the Gymnasium Python library, which provides us with lots of functionality to build our own agents to learn to play the available gym envs. For our use case, we will not be training an agent to play a game, instead, we will be training an agent to balanced Monster opponent. We can still use the Gymnasium’s API to create our own environment for this task however.

Init

Our MonsterWorldGenEnv will inherit for gym Env. We provide a metadata dict which will tell us the available display modes and the fps. We also create the target enemy we will try to create an opponent for and create a map of actions to stat changes (where index 0 is health, 1 is damage, 2 is speed and 3 is armor).

class MonsterWorldGenEnv(gym.Env):
metadata = {"render_modes": [], "render_fps": 4}

def __init__(self):
self.opponent_stats = {"health": 50, "speed": 25, "damage": 25, "armor": 25}
self.observation_space = spaces.Dict(
{
"stats": spaces.Box(1, 50, shape=(4,), dtype=int),
}
)

self.action_space = spaces.Discrete(8)

self._action_to_direction = {
0: np.array([1,0,0,0]),
1: np.array([-1,0,0,0]),
2: np.array([0,1,0,0]),
3: np.array([0,-1,0,0]),
4: np.array([0, 0, 1, 0]),
5: np.array([0, 0, -1, 0]),
6: np.array([0, 0, 0, 1]),
7: np.array([0, 0, 0, -1]),
}

Helpers

Our get_obs helper will return the current stats of our monster (which we will use in our next two functions).

     def _get_obs(self):
return {"stats": self._stats}

Reset

Reset gets called at the beginning of each training episode and restores the environment state.

      def reset(self, seed=None, options=None):
super().reset(seed=seed)

self._stats = self.np_random.integers(1, 50, size=4, dtype=int)

observation = self._get_obs()

return observation, {}

Step

To evaluate our score we will need to create a Monster class with the ability to battle two monsters, we will battle the monsters against each other 100 times to extract our generated monster’s win rate.

class Monster:
def __init__(self, health, damage, speed, armor):
self.health = health
self.damage = damage
self.speed = speed
self.armor = armor

def take_turn(self, opponent):
# Calculate if opponent dodges
dodge_chance = opponent.speed / 100
if random.random() < dodge_chance:
return

# Calculate damage dealt
damage_dealt = max(1, self.damage - opponent.armor)
opponent.health -= damage_dealt

Every step, we update our monster’s stats based on the picked option, then we play against the target monster 100 times to get our win rate and reward ourselves for finding a monster that has a win rate close to 50% and punish it for each step to make sure it finds a solution quickly.

    def step(self, action):
direction = self._action_to_direction[action]
self._stats = np.clip(self._stats + direction, 1, 50)

observation = self._get_obs()

win_rate = self.play_opponent()

reward = (0.5 - abs(win_rate - 0.5)) / 0.5
terminated = 1.0 - reward <= 0.01
if terminated:
reward = 10
else:
reward -= 1.0

return observation, reward, terminated, False, {}

def play_opponent(self):
wins = 0
for i in range(100):
monster1 = Monster(self._stats[0], self._stats[1], self._stats[2], self._stats[3])
monster2 = Monster(self.opponent_stats['health'], self.opponent_stats['damage'], self.opponent_stats['speed'], self.opponent_stats['armor'])
while monster1.health > 0 and monster2.health > 0:
monster1.take_turn(monster2)
if monster2.health <= 0:
wins += 1
break
monster2.take_turn(monster1)
if monster1.health <= 0:
break
return wins / 100

Summary

  • The gymansium python package allows us to easily create custom environments with a visual aspect
  • Our gyms have observation spaces and action spaces for the agent to use
  • We can reward or punish the agent each step for the action it takes

The Agent Training Process

Our Agent will have some functions to act, remember the results of actions and replay (and train) using these action -> consequence memories. We will put the Agent in our gym and run it for a bunch of episodes allowing it to interact with the world and observe the results of its actions.

The Agent Setup

Our Agent’s brain will be a Neural Network, this is where we will learn our policy. The observations will be fed into the network as input and the actions to be preformed will come out as outputs.

In previous articles I have used Tensorflow mostly, however Tensorflow’s RL modules are extremely lacking, so for this article we will use pytorch.

class DQN(nn.Module):
def __init__(self, input_dim, output_dim):
super(DQN, self).__init__()
self.fc1 = nn.Linear(input_dim, 128)
self.fc2 = nn.Linear(128, 128)
self.fc3 = nn.Linear(128, output_dim)

def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)

return x

Our model is quite simple and has an input layer for the observation space followed by a couple fully connected layers and an output layer. The output layer corresponds to our action space of size 8 which maps to incrementing or decrementing our 4 attributes.

We will now build our agent around this neural network brain.


class DQNAgent:
def __init__(self, state_dim, action_dim, lr, gamma, epsilon, epsilon_decay, buffer_size):
self.state_dim = state_dim
self.action_dim = action_dim
self.lr = lr
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.memory = deque(maxlen=buffer_size)
self.model = DQN(state_dim, action_dim)
self.optimizer = optim.Adam(self.model.parameters(), lr=lr)

def act(self, state):
if np.random.rand() <= self.epsilon:
return np.random.choice(self.action_dim)
q_values = self.model(torch.tensor(state, dtype=torch.float32))
return torch.argmax(q_values).item()

def remember(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))

def replay(self, batch_size):
if len(self.memory) < batch_size:
return
minibatch = random.sample(self.memory, batch_size)
for state, action, reward, next_state, done in minibatch:
state = torch.tensor(state, dtype=torch.float32)
next_state = torch.tensor(next_state, dtype=torch.float32)

q_values = self.model(state)
next_q_values = self.model(next_state)

target = reward + (self.gamma * torch.max(next_q_values).item()) if not done else reward

target_f = self.model(state)

target_f[action] = target

# Compute the loss
loss = nn.MSELoss()(q_values, target_f)

# Total loss
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
if self.epsilon > 0.01:
self.epsilon *= self.epsilon_decay

We initialize the agent with some familiar and obvious parameters: state (observation) dimension, the action dimension, the learning rate, a memory buffer (to store our action -> consequence items), our model and an optimizer.

However, there are a few parameters which are unique to the RL process which we will go over.

Gamma is the discount factor that determines the importance of future rewards, with lower values prioritizing immediate rewards and higher values emphasizing long-term gains.

Epsilon is the threshold for how often a random choice should be made instead of listening to the neural network. Training an RL agent is about balancing exploration vs exploitation. At the start of learning, our agent knows nothing so we want it to explore as much as possible (epsilon starts at 1.0). Later on, we expect our agent to have learned a good policy so we want to stick with that policy and explore less often. This is why we have an epsilon_decay to slowly decreases how often we explore over time.

When we act we either randomly choose actions — if exploring — or ask our Neural Net to predict the action we should take.

The replay function samples a minibatch from the agent's memory and updates the Q-values using the model's predictions for the current state and next state, adjusting them based on rewards and future expected rewards (discounted by gamma). It computes the loss for the Q-value estimate (the status update action), back-propagates the total loss, and adjusts the model's parameters, while also decaying the exploration rate (epsilon).

Learning

Now our agent is ready to learn!

First we need to load up our gym environment.


from monster_gen_world import MonsterWorldGenEnv
from gymnasium.envs.registration import register

register(
id="gym_examples/MonsterWorldGenEnv-v0",
entry_point="monster_gen_world:MonsterWorldGenEnv",
max_episode_steps=200,
)

We will run for 100 episodes, we will continue each episode until we find an appropriate set of stats for our monster or we hit the max steps (200).


# Initialize environment and agent with Experience Replay Buffer
env = gym.make("gym_examples/MonsterWorldGenEnv-v0",)
state_dim = 4
action_dim = env.action_space.n
agent = DQNAgent(state_dim, action_dim, lr=0.0001, gamma=0.99, epsilon=1.0, epsilon_decay=0.9995, buffer_size=10000)

# Train the DQN agent with Experience Replay Buffer
batch_size = 128
num_episodes = 100
for episode in range(num_episodes):
state, _ = env.reset()
done = False
total_reward = 0
step = 0
while not done:
action = agent.act(state["stats"])
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
agent.remember(state["stats"], action, reward, next_state["stats"], done)
state = next_state
total_reward += reward
step += 1
agent.replay(batch_size)
print(f"Episode: {episode + 1}, Total Reward: {total_reward} Total Steps: {step} Agent Epsilon: {agent.epsilon}")

Now we can use the agent to synthesize stats for monsters which are the same strength as our target monster.


state, _ = env.reset()
done = False
total_reward = 0
reward = 0.0
done = False
agent.epsilon = 0.0
while not done:
action = agent.act(state["stats"])
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
state = next_state
total_reward += reward
print(state)
print(reward)

The agent is able to find a viable set of stats much faster than at the beginning of the training process.

Summary

  • We use a neural network to learn a policy for our agent
  • We train our agent by playing episodes, extracting rewards for actions and then updating our network to pick choices which would maximize our future rewards

Conclusion

In this article, we learned how to build our own gym environment and teach an agent to learn in the environment storing its brain in a neural network instead of a policy table. Normally, RL tasks are focused around teaching an agent to learn to play a game, but in this case we taught one to generate content for a game.

--

--

Matthew MacFarquhar
Matthew MacFarquhar

Written by Matthew MacFarquhar

I am a software engineer working for Amazon living in SF/NYC.

No responses yet