Procedural Content Generation With Deep Q-Learning

7 min readSep 26, 2024

Introduction

I have been reading this book on Procedural Content Generation and its integration with ML. As I go through this book, I will try my best to implement the concepts we go over in small, easily implementable examples. In this chapter, we will explore using Reinforcement Learning to teach an agent to create Monster stats for a game which will win against our target opponent 50% of the time using Deep Q-Learning (creating a balanced opponent).

The Notebooks used in this article can be found in the below repository.

PCGML/reinforcement-learning-pcg at master · mattmacf98/PCGML

Contribute to mattmacf98/PCGML development by creating an account on GitHub.

github.com

Deep Q-Learning

Q-learning is a reinforcement learning algorithm that learns the value (Q-value) of taking a specific action in a given state by updating a table based on rewards received from the environment. In Q-learning, this table represents the policy that tells the agent what action to take for each state. Deep Q-learning (DQN) replaces this table with a neural network that approximates the Q-values, enabling the algorithm to handle more complex environments with continuous or high-dimensional state spaces.

The Gym Environment

We will be using the Gymnasium Python library, which provides us with lots of functionality to build our own agents to learn to play the available gym envs. For our use case, we will not be training an agent to play a game, instead, we will be training an agent to balanced Monster opponent. We can still use the Gymnasium’s API to create our own environment for this task however.

Init

Our MonsterWorldGenEnv will inherit for gym Env. We provide a metadata dict which will tell us the available display modes and the fps. We also create the target enemy we will try to create an opponent for and create a map of actions to stat changes (where index 0 is health, 1 is damage, 2 is speed and 3 is armor).

class MonsterWorldGenEnv(gym.Env):
    metadata = {"render_modes": [], "render_fps": 4}

    def __init__(self):
        self.opponent_stats = {"health": 50, "speed": 25, "damage": 25, "armor": 25}
        self.observation_space = spaces.Dict(
            {
                "stats": spaces.Box(1, 50, shape=(4,), dtype=int),
            }
        )

        self.action_space = spaces.Discrete(8)

        self._action_to_direction = {
            0: np.array([1,0,0,0]),
            1: np.array([-1,0,0,0]),
            2: np.array([0,1,0,0]),
            3: np.array([0,-1,0,0]),
            4: np.array([0, 0, 1, 0]),
            5: np.array([0, 0, -1, 0]),
            6: np.array([0, 0, 0, 1]),
            7: np.array([0, 0, 0, -1]),
        }

Helpers

Our get_obs helper will return the current stats of our monster (which we will use in our next two functions).

     def _get_obs(self):
        return {"stats": self._stats}

Reset

Reset gets called at the beginning of each training episode and restores the environment state.

      def reset(self, seed=None, options=None):
        super().reset(seed=seed)

        self._stats = self.np_random.integers(1, 50, size=4, dtype=int)

        observation = self._get_obs()

        return observation, {}

Step

To evaluate our score we will need to create a Monster class with the ability to battle two monsters, we will battle the monsters against each other 100 times to extract our generated monster’s win rate.

class Monster:
    def __init__(self, health, damage, speed, armor):
        self.health = health
        self.damage = damage
        self.speed = speed
        self.armor = armor

    def take_turn(self, opponent):
        # Calculate if opponent dodges
        dodge_chance = opponent.speed / 100
        if random.random() < dodge_chance:
            return

        # Calculate damage dealt
        damage_dealt = max(1, self.damage - opponent.armor)
        opponent.health -= damage_dealt

Every step, we update our monster’s stats based on the picked option, then we play against the target monster 100 times to get our win rate and reward ourselves for finding a monster that has a win rate close to 50% and punish it for each step to make sure it finds a solution quickly.

    def step(self, action):
        direction = self._action_to_direction[action]
        self._stats = np.clip(self._stats + direction, 1, 50)

        observation = self._get_obs()

        win_rate = self.play_opponent()

        reward = (0.5 - abs(win_rate - 0.5)) / 0.5
        terminated = 1.0 - reward <= 0.01
        if terminated:
            reward = 10
        else:
            reward -= 1.0

        return observation, reward, terminated, False, {}

    def play_opponent(self):
        wins = 0
        for i in range(100):
            monster1 = Monster(self._stats[0], self._stats[1], self._stats[2], self._stats[3])
            monster2 = Monster(self.opponent_stats['health'], self.opponent_stats['damage'], self.opponent_stats['speed'], self.opponent_stats['armor'])
            while monster1.health > 0 and monster2.health > 0:
                monster1.take_turn(monster2)
                if monster2.health <= 0:
                    wins += 1
                    break
                monster2.take_turn(monster1)
                if monster1.health <= 0:
                    break
        return wins / 100

Summary

The gymansium python package allows us to easily create custom environments with a visual aspect
Our gyms have observation spaces and action spaces for the agent to use
We can reward or punish the agent each step for the action it takes

The Agent Training Process

Our Agent will have some functions to act, remember the results of actions and replay (and train) using these action -> consequence memories. We will put the Agent in our gym and run it for a bunch of episodes allowing it to interact with the world and observe the results of its actions.

The Agent Setup

Our Agent’s brain will be a Neural Network, this is where we will learn our policy. The observations will be fed into the network as input and the actions to be preformed will come out as outputs.

In previous articles I have used Tensorflow mostly, however Tensorflow’s RL modules are extremely lacking, so for this article we will use pytorch.

class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 128)  
        self.fc2 = nn.Linear(128, 128)      
        self.fc3 = nn.Linear(128, output_dim)     

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)

        return x

Our model is quite simple and has an input layer for the observation space followed by a couple fully connected layers and an output layer. The output layer corresponds to our action space of size 8 which maps to incrementing or decrementing our 4 attributes.

We will now build our agent around this neural network brain.


class DQNAgent:
    def __init__(self, state_dim, action_dim, lr, gamma, epsilon, epsilon_decay, buffer_size):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.lr = lr
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.memory = deque(maxlen=buffer_size)
        self.model = DQN(state_dim, action_dim)
        self.optimizer = optim.Adam(self.model.parameters(), lr=lr)

    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return np.random.choice(self.action_dim)
        q_values = self.model(torch.tensor(state, dtype=torch.float32))
        return torch.argmax(q_values).item()

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def replay(self, batch_size):
        if len(self.memory) < batch_size:
            return
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            state = torch.tensor(state, dtype=torch.float32)
            next_state = torch.tensor(next_state, dtype=torch.float32)

            q_values = self.model(state)
            next_q_values = self.model(next_state)

            target = reward + (self.gamma * torch.max(next_q_values).item()) if not done else reward

            target_f = self.model(state)
            
            target_f[action] = target

            # Compute the loss
            loss = nn.MSELoss()(q_values, target_f)
        
            # Total loss
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
        if self.epsilon > 0.01:
            self.epsilon *= self.epsilon_decay

We initialize the agent with some familiar and obvious parameters: state (observation) dimension, the action dimension, the learning rate, a memory buffer (to store our action -> consequence items), our model and an optimizer.

However, there are a few parameters which are unique to the RL process which we will go over.

Gamma is the discount factor that determines the importance of future rewards, with lower values prioritizing immediate rewards and higher values emphasizing long-term gains.

Epsilon is the threshold for how often a random choice should be made instead of listening to the neural network. Training an RL agent is about balancing exploration vs exploitation. At the start of learning, our agent knows nothing so we want it to explore as much as possible (epsilon starts at 1.0). Later on, we expect our agent to have learned a good policy so we want to stick with that policy and explore less often. This is why we have an epsilon_decay to slowly decreases how often we explore over time.

When we act we either randomly choose actions — if exploring — or ask our Neural Net to predict the action we should take.

The replay function samples a minibatch from the agent's memory and updates the Q-values using the model's predictions for the current state and next state, adjusting them based on rewards and future expected rewards (discounted by gamma). It computes the loss for the Q-value estimate (the status update action), back-propagates the total loss, and adjusts the model's parameters, while also decaying the exploration rate (epsilon).

Learning

Now our agent is ready to learn!

First we need to load up our gym environment.


from monster_gen_world import MonsterWorldGenEnv
from gymnasium.envs.registration import register

register(
     id="gym_examples/MonsterWorldGenEnv-v0",
     entry_point="monster_gen_world:MonsterWorldGenEnv",
     max_episode_steps=200,
)

We will run for 100 episodes, we will continue each episode until we find an appropriate set of stats for our monster or we hit the max steps (200).


# Initialize environment and agent with Experience Replay Buffer
env = gym.make("gym_examples/MonsterWorldGenEnv-v0",)
state_dim = 4
action_dim = env.action_space.n
agent = DQNAgent(state_dim, action_dim, lr=0.0001, gamma=0.99, epsilon=1.0, epsilon_decay=0.9995, buffer_size=10000)

# Train the DQN agent with Experience Replay Buffer
batch_size = 128
num_episodes = 100
for episode in range(num_episodes):
    state, _ = env.reset()
    done = False
    total_reward = 0
    step = 0
    while not done:
        action = agent.act(state["stats"])
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        agent.remember(state["stats"], action, reward, next_state["stats"], done)
        state = next_state
        total_reward += reward
        step += 1
        agent.replay(batch_size)
    print(f"Episode: {episode + 1}, Total Reward: {total_reward} Total Steps: {step} Agent Epsilon: {agent.epsilon}")

Now we can use the agent to synthesize stats for monsters which are the same strength as our target monster.


state, _ = env.reset()
done = False
total_reward = 0
reward = 0.0
done = False
agent.epsilon = 0.0
while not done:
    action = agent.act(state["stats"])
    next_state, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated
    state = next_state
    total_reward += reward
print(state)
print(reward)

The agent is able to find a viable set of stats much faster than at the beginning of the training process.

Summary

We use a neural network to learn a policy for our agent
We train our agent by playing episodes, extracting rewards for actions and then updating our network to pick choices which would maximize our future rewards

Conclusion

In this article, we learned how to build our own gym environment and teach an agent to learn in the environment storing its brain in a neural network instead of a policy table. Normally, RL tasks are focused around teaching an agent to learn to play a game, but in this case we taught one to generate content for a game.

Procedural Content Generation With Deep Q-Learning

Introduction

PCGML/reinforcement-learning-pcg at master · mattmacf98/PCGML

Contribute to mattmacf98/PCGML development by creating an account on GitHub.

Deep Q-Learning

The Gym Environment

Init

Helpers

Reset

Step

Summary

The Agent Training Process

The Agent Setup

Learning

Summary

Conclusion

Written by Matthew MacFarquhar

No responses yet