Procedural Content Generation With Grid Based Neural Networks

Matthew MacFarquhar
11 min readSep 24, 2024

--

Introduction

I have been reading this book on Procedural Content Generation and its integration with ML. As I go through this book, I will try my best to implement the concepts we go over in small, easily implementable examples. In this chapter, we will explore grid based networks which utilize CNNs as their backbone. We will create a Variational AutoEncoder for angry birds levels and then we will create a pokemon GAN to generate silhouettes of new pokemon.

The Notebooks used in this article can be found in the below repository.

CNNs

Convolutional Neural Networks (CNNs) are deep learning models designed to process structured grid-like data, such as images, by applying convolutional layers that capture spatial hierarchies and patterns like edges, textures, and objects. They consist of multiple layers, including convolutional, pooling, and fully connected layers, which together enable feature extraction and classification.

Variational Auto-Encoder

We have already explored Auto-Encoders in Procedural Content Generation With Neural Networks. We hinted in that article that we could make our embedding layer better distributed by turning it into a VAE.

The key difference between a Variational Autoencoder (VAE) and a basic Autoencoder is that a VAE learns to generate a continuous, probabilistic latent space by modeling the data distribution, whereas a basic Autoencoder directly maps inputs to a compressed latent representation. We do this by learning a mean and variance of Z instead of a direct position in Z for a given input. We also add a KL-diveregence loss to these learned values, punishing them for being far away from a standard normal.

Cleaning The Data

We will be learning to generate Angry Birds levels. The first step is to clean the data up a little bit and extract some class frequencies for the tile data.

def clean(entry):
if str(entry) == "nan":
return 0
return min(5, int(str(entry)[0]))
vectorized_clean = np.vectorize(clean)

levels = np.array([])
key = {}
for i in range(1, 61):
file_name = f'angry_birds_levels/level {i}.txt'
level = pd.read_csv(file_name, sep='\t', header=None)
cleaned_level = vectorized_clean(level)
for col in range(cleaned_level.shape[0]):
for row in range(cleaned_level.shape[1]):
entry = cleaned_level[col][row]
if entry in key:
key[entry] += 1
else:
key[entry] = 1
levels = np.append(levels, cleaned_level)

We load 60 levels with shapes 12x15 into numerical matrices of classes for each tile (which can be 0 — EMPTY, 1 — ICE, 2 — STONE, 3 — WOOD, 4 — PIG, 5– dynamite).

print(key)

total_samples = sum(key.values())
num_classes = len(key)
class_weights = {cls: total_samples / (num_classes * freq) for cls, freq in key.items()}
weights_array = np.array([class_weights[cls] for cls in sorted(key.keys())])
weights_tensor = tf.constant(weights_array, dtype=tf.float32)

print(weights_tensor)
levels = levels.reshape(-1,12,15)
print(levels.shape)
print(levels[0])

Next we extract some class weights to use to skew our training to be heavier on the classes that appear less in our data set.

def display_level_image(tile_dim, level, tile_name_to_image):
level_image = Image.new("RGB", (tile_dim[1] * len(level[0]), tile_dim[1] * len(level) ))
for row in range(len(level)):
for col in range(len(level[0])):
level_image.paste(tile_name_to_image[level[row][col]],(col * tile_dim[0], row * tile_dim[1], (col+1) * tile_dim[0], (row +1) *tile_dim[1]))

plt.figure()
plt.imshow(level_image)
plt.show()

We will create a function to display a given matrix of tiles.

tile_name_to_image = {}
tiles = ["empty", "ice", "stone", "wood", "pig", "tnt"]
for i in range(len(tiles)):
tile_name_to_image[i] = Image.open(f'angry_birds_tiles/{tiles[i]}.png')

tile_dim = tile_name_to_image[0].size

display_level_image(tile_dim, levels[0], tile_name_to_image)

Below is an example of a level in our training set

The Model

Our model will be a pretty basic U-Net like one, which reduces the dimensionality of our data with a series of Convolution layers down to a bottleneck and then grows it back to the input dimension using Conv2DTranspose layers.

num_classes = 6
latent_dim = 8
level_shape = (12,15,1)
dropout_rate = 0.5

# DOWN CONV with Dropout and L2 regularization
encoder_input = tf.keras.Input(shape=level_shape)
conv_1 = tf.keras.layers.Conv2D(filters=16, kernel_size=3, padding="same", activation="relu")(encoder_input)
conv_1 = tf.keras.layers.Dropout(dropout_rate)(conv_1) # Add dropout
conv_2 = tf.keras.layers.Conv2D(filters=32, kernel_size=3, padding="same", activation="relu")(conv_1)
conv_2 = tf.keras.layers.Dropout(dropout_rate)(conv_2)
conv_3 = tf.keras.layers.Conv2D(filters=32, kernel_size=3, padding="same",activation="relu")(conv_2)
conv_3 = tf.keras.layers.Dropout(dropout_rate)(conv_3)

flatten = tf.keras.layers.Flatten()(conv_3)

# Mu and Sigma
encoder_output = tf.keras.layers.Dense(256, activation="relu")(flatten)
z_mu = tf.keras.layers.Dense(latent_dim)(encoder_output)
z_log_sigma = tf.keras.layers.Dense(latent_dim)(encoder_output)

# sample re-param trick
def sampling(args):
z_mu, z_log_sigma = args
epsilon = tf.random.normal(
shape=(tf.shape(z_mu)[0], latent_dim), mean=0., stddev=1.)
return z_mu + tf.exp(z_log_sigma) * epsilon
z = tf.keras.layers.Lambda(sampling, output_shape=(latent_dim,))([z_mu, z_log_sigma])

# UP CONV with reduced complexity and Dropout
dense_1 = tf.keras.layers.Dense(256, activation="relu")(z)
dense_2 = tf.keras.layers.Dense(np.prod(np.shape(conv_3)[1:]), activation="relu")(dense_1)
reshape = tf.keras.layers.Reshape(np.shape(conv_3)[1:])(dense_2)

conv_4 = tf.keras.layers.Conv2DTranspose(filters=32, kernel_size=3, padding="same", activation="relu")(reshape)
conv_4 = tf.keras.layers.Dropout(dropout_rate)(conv_4)
conv_5 = tf.keras.layers.Conv2DTranspose(filters=32, kernel_size=3, padding="same", activation="relu")(conv_4)
conv_5 = tf.keras.layers.Dropout(dropout_rate)(conv_5)
conv_6 = tf.keras.layers.Conv2DTranspose(filters=16, kernel_size=3, padding="same", activation="relu")(conv_5)

decoder_output = tf.keras.layers.Conv2D(filters=num_classes, kernel_size=3, padding="same", activation="softmax")(conv_6)
flattened_output = tf.keras.layers.Reshape((12 * 15 * num_classes,))(decoder_output)

concattenated_output = tf.keras.layers.Concatenate()([flattened_output, z_mu, z_log_sigma])

def vae_loss(y_true, output, kl_loss_weight=0.1):
y_pred = output[:,0:12*15*num_classes]
y_pred = tf.reshape(y_pred, shape=(-1, 12, 15, num_classes))

z_mu = output[:,12*15*num_classes: 12*15*num_classes + latent_dim]
z_log_sigma = output[:,12*15*num_classes + latent_dim: 12*15*num_classes + latent_dim*2]

reconstruction_loss = tf.reduce_mean(tf.keras.metrics.sparse_categorical_crossentropy(y_true, y_pred))
kl_loss = -0.5 * tf.reduce_mean(1 + z_log_sigma - z_mu**2 - tf.exp(z_log_sigma), axis=-1)
res = kl_loss_weight * kl_loss + reconstruction_loss
return res

# Model compilation
vae_cnn = tf.keras.Model(inputs=encoder_input, outputs=concattenated_output)
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0005)
vae_cnn.compile(optimizer=optimizer, loss=vae_loss)
vae_cnn.summary()

We have functions to sample from our learned mu and sigma to generate our Z latent dimension vectors — we picked 8 as the latent dimension.

We also created a custom loss function which incorporates the KL divergence loss to push the learned distribution towards a standard normal and the categorical cross entropy loss for actually predicting the classes.

I had to make a hacky concat layer to get my mu and sigma out of the last layer and then hard coded some sublist bounds for the different outputs. This is one limitation of Tensorflow I was not a big fan of (pytorch is much better at allowing us to return multiple output layers from a model).

We can now train our network.

x = levels.reshape((levels.shape[0], levels.shape[1], levels.shape[2], 1))

vae_cnn.fit(x=x, y=x, epochs=250, batch_size=16, validation_split=0.1)

Generate

We can generate examples to see how well we did. We will start with doing an end to end pass through of a level to see if it could be properly encoded in our 8 dimension latent space.

test = x[0]
gen = vae_cnn.predict(test.reshape(1,12,15,1))[0][:12*15*6].reshape((12,15,num_classes))
print(gen.shape)

generated_level = np.argmax(gen, axis=2)

display_level_image(tile_dim, generated_level, tile_name_to_image)

Our embedding is probabilistic, so each sample with the same input will provide different outputs. However, our reconstruction is pretty close to the original without overtraining too much.

We will then split into two seperate models — and encoder and a decoder.

decoder_input = tf.keras.Input(shape=(latent_dim,))
decoder_layer = vae_cnn.layers[-11](decoder_input)
decoder_layer = vae_cnn.layers[-10](decoder_layer)
decoder_layer = vae_cnn.layers[-9](decoder_layer)
decoder_layer = vae_cnn.layers[-8](decoder_layer)
decoder_layer = vae_cnn.layers[-7](decoder_layer)
decoder_layer = vae_cnn.layers[-6](decoder_layer)
decoder_layer = vae_cnn.layers[-5](decoder_layer)
decoder_layer = vae_cnn.layers[-4](decoder_layer)
decoder_layer = vae_cnn.layers[-3](decoder_layer)
decoder = tf.keras.Model(decoder_input, decoder_layer)

encoder = tf.keras.Model(encoder_input, z)

We can then interolate between two examples.

vector_one = encoder.predict(np.array(x[0]).reshape(1, 12, 15, 1))[0]
vector_two = encoder.predict(np.array(x[1]).reshape(1, 12, 15, 1))[0]

a = 0.7
vector = vector_one * a + vector_two * (1-a)

gen = decoder.predict(np.array(vector).reshape(1,latent_dim))[0]
generated_level = np.argmax(gen, axis=2)
display_level_image(tile_dim, generated_level, tile_name_to_image)

Our training worked pretty well in cases seen during training, however, it did not generalize to new levels very well. We did suffer from an extreme lack of training data, we probably could have done better if we had more data.

Summary

  • We can use CNNs on grid based data to condense data down to a bottleneck layer and then up sample back to the original size
  • Using a re-parameterization trick, we can learn a distribution of the embedding layer and add a loss term called KL-divergence to push our distribution to more closely model a standard normal one.

GAN

A Generative Adversarial Network (GAN) is a type of machine learning model consisting of two neural networks: a generator that creates synthetic data, and a discriminator that evaluates the data’s authenticity. The two networks are trained together, with the generator aiming to produce increasingly realistic data while the discriminator tries to distinguish between real and generated data.

We will build a GAN to generate pokemon silhouettes.

NOTE: this network was quite large and you would definitely benefit from using a GPU.

Cleaning The Data

We will load the images of the pokemon into an array, there are a couple sprites which do not have the expected 42x56 pixel shape, so we will toss those out.

images = []
folder_path = "pokemon/"
for filename in os.listdir(folder_path):
if filename.endswith(".png"):
img_path = os.path.join(folder_path, filename)
img = Image.open(img_path).convert('RGB')
img_array = np.array(img)
if img_array.shape[0] != 42 or img_array.shape[1] != 56:
continue
images.append(img_array)
print(len(images))

We will now need to pre-process the data, extracting the two most common colors and then setting each pixel to be one of those by getting its nearest neighbor — creating a silhouette like look.

pixel_color_frequency = defaultdict(int)

for image in images:
for row in range(image.shape[0]):
for col in range(image.shape[1]):
pixel_color = tuple(image[row, col]) # (R, G, B)

# Increment the count for this pixel color
pixel_color_frequency[pixel_color] += 1

# Convert defaultdict back to a normal dictionary if needed
pixel_color_frequency = dict(pixel_color_frequency)

print(pixel_color_frequency)
k = 2
color_frequency_list = list(pixel_color_frequency.items())
color_frequency_list.sort(key=lambda x: x[1], reverse=True)
top_k_colors = [color for color, freq in color_frequency_list[:k]]

print(top_k_colors)
print(len(top_k_colors))
def replace_with_closest_colors(images, top_k_colors):
num_images, height, width, _ = images.shape
flattened_images = images.reshape(-1, 3)

distances = cdist(flattened_images, top_k_colors, metric='euclidean')
closest_color_indices = np.argmin(distances, axis=1)

closest_colors = top_k_colors[closest_color_indices]
new_images = closest_colors.reshape(num_images, height, width, 3)

closest_indices = closest_color_indices
index_images = closest_color_indices.reshape(num_images, height, width, 1)

return new_images, index_images

processed_images, index_images = replace_with_closest_colors(np.array(images), np.array(top_k_colors))
print(processed_images.shape)
index_images_squeezed = np.squeeze(index_images, axis=-1)
one_hot_encoded_images = np.eye(k)[index_images_squeezed]
print(one_hot_encoded_images.shape)

The Model

Our model will consist of a generator and discriminator. The generator will take in a vector of noise and turn it into a 42x56 image, and the discriminator will take in a 42x56 image and spit out the probability that the given image was generated or real.

class MinibatchDiscrimination(tf.keras.layers.Layer):
def __init__(self, num_kernels, kernel_dim):
super(MinibatchDiscrimination, self).__init__()
self.num_kernels = num_kernels
self.kernel_dim = kernel_dim

def build(self, input_shape):
super(MinibatchDiscrimination, self).build(input_shape)
# Create a tensor T, used for the minibatch discrimination
self.T = self.add_weight(
shape=(input_shape[-1], self.num_kernels * self.kernel_dim),
initializer='random_normal',
trainable=True,
name='T'
)

def call(self, x):
# Multiply the input with T
M = tf.matmul(x, self.T)
M = tf.reshape(M, (-1, self.num_kernels, self.kernel_dim))

# Compute the L1 distances between the vectors in the batch
M_diff = tf.expand_dims(M, 0) - tf.expand_dims(M, 1)
abs_diff = tf.reduce_sum(tf.abs(M_diff), axis=3)

# Compute the exponential of the negative of the L1 distance
minibatch_features = tf.reduce_sum(tf.exp(-abs_diff), axis=1)

return tf.concat([x, minibatch_features], axis=1)

class Generator(tf.keras.Model):
def __init__(self):
super(Generator, self).__init__()

self.leaky_relu_1 = tf.keras.layers.LeakyReLU(alpha=0.2)
self.leaky_relu_2 = tf.keras.layers.LeakyReLU(alpha=0.2)
self.leaky_relu_3 = tf.keras.layers.LeakyReLU(alpha=0.2)
self.dense_1 = tf.keras.layers.Dense(512)
self.batch_norm_1 = tf.keras.layers.BatchNormalization()
self.dense_2 = tf.keras.layers.Dense(42 * 56 * k)
self.batch_norm_2 = tf.keras.layers.BatchNormalization()
self.reshape = tf.keras.layers.Reshape((42, 56, k))
self.convTrans_1 = tf.keras.layers.Conv2DTranspose(k, (4, 4), strides=(1, 1), padding='same')
self.batch_norm_3 = tf.keras.layers.BatchNormalization()
self.conv = tf.keras.layers.Conv2D(k, (3, 3), activation='softmax', padding='same')

def call(self, inputs):
x = self.dense_1(inputs)
x = self.batch_norm_1(x)
x = self.leaky_relu_1(x)
x = self.dense_2(x)
x = self.batch_norm_2(x)
x = self.leaky_relu_2(x)
x = self.reshape(x)
x = self.convTrans_1(x)
x = self.batch_norm_3(x)
x = self.leaky_relu_3(x)
return self.conv(x)

class Discriminator(tf.keras.Model):
def __init__(self):
super(Discriminator, self).__init__()

self.leaky_relu_1 = tf.keras.layers.LeakyReLU(alpha=0.2)
self.leaky_relu_2 = tf.keras.layers.LeakyReLU(alpha=0.2)
self.leaky_relu_3 = tf.keras.layers.LeakyReLU(alpha=0.2)
self.dropout_1 = tf.keras.layers.Dropout(0.6)
self.dropout_2 = tf.keras.layers.Dropout(0.6)
self.dropout_3 = tf.keras.layers.Dropout(0.6)
self.dropout_4 = tf.keras.layers.Dropout(0.6)
self.conv_1 = tf.keras.layers.Conv2D(k, (3, 3), padding='same')
self.conv_2 = tf.keras.layers.Conv2D(32, (3, 3), strides=(2, 2), padding='same')
self.conv_3 = tf.keras.layers.Conv2D(64, (3, 3), strides=(2, 2), padding='same')
self.flatten = tf.keras.layers.Flatten()
self.minibatch_discrimination = MinibatchDiscrimination(num_kernels=10, kernel_dim=5)
self.out = tf.keras.layers.Dense(1, activation='sigmoid')

def call(self, inputs):
x = self.conv_1(inputs)
x = self.leaky_relu_1(x)
x = self.dropout_1(x)
x = self.conv_2(x)
x = self.leaky_relu_2(x)
x = self.dropout_2(x)
x = self.conv_3(x)
x = self.leaky_relu_3(x)
x = self.dropout_3(x)
x = self.flatten(x)
x = self.dropout_4(x)
x = self.minibatch_discrimination(x)
return self.out(x)

The MinibatchDiscrimination layer helps prevent mode collapse in GANs by introducing diversity into the discriminator’s decision-making process. It computes the L1 distances between feature vectors within a batch, encouraging the discriminator to consider how similar or different samples are from each other, rather than just evaluating them individually.

Our GAN will combine the generator and discriminator, to train we will punish the discriminator for incorrectly predicting the fake and real images. We will punish the generator for failing to fool the discriminator.

class GAN(tf.keras.Model):
# define the models
def __init__(self, discriminator, generator, latent_dim):
super(GAN, self).__init__()
self.discriminator = discriminator
self.generator = generator
self.latent_dim = latent_dim

# Define the compiler
def compile(self, disc_optimizer, gen_optimizer, loss_fn, generator_loss, discriminator_loss):
super(GAN, self).compile()
self.disc_optimizer = disc_optimizer
self.gen_optimizer = gen_optimizer
self.generator_loss = generator_loss
self.discriminator_loss = discriminator_loss
self.loss_fn = loss_fn


# @tf.function: The below function is completely Tensor Code
# Good for optimization
@tf.function
# Modify Train step for GAN
def train_step(self, images):
batch_size = tf.shape(images)[0]
noise = tf.random.normal([batch_size, self.latent_dim])

# Define the loss function
with tf.GradientTape(persistent=True) as tape:
generated_image_logits = self.generator(noise)

real_output = self.discriminator(images)
fake_output = self.discriminator(generated_image_logits)

gen_loss = self.generator_loss(self.loss_fn, fake_output)
disc_loss = self.discriminator_loss(self.loss_fn, real_output, fake_output)

# Calculate Gradient
grad_disc = tape.gradient(disc_loss, self.discriminator.trainable_variables)
grad_gen = tape.gradient(gen_loss, self.generator.trainable_variables)

# Optimization Step: Update Weights & Learning Rate
self.disc_optimizer.apply_gradients(zip(grad_disc, self.discriminator.trainable_variables))
self.gen_optimizer.apply_gradients(zip(grad_gen, self.generator.trainable_variables))

return {"Gen Loss ": gen_loss,"Disc Loss" : disc_loss}
latent_dim = 512
epochs = 1000
batch_size = 64

gen_optimizer = tf.keras.optimizers.Adam(0.0001)
disc_optimizer = tf.keras.optimizers.Adam(0.00005)

def discriminator_loss(loss_object, real_output, fake_output):
real_loss = loss_object(tf.ones_like(real_output), real_output)
fake_loss = loss_object(tf.zeros_like(fake_output), fake_output)
total_loss = real_loss + fake_loss
return total_loss

# loss object: Binary Crossentropy Loss
# discriminator_probability: Result from Discriminator 0 = Fake 1 = Real
def generator_loss(loss_object, discriminator_probability):
return loss_object(tf.ones_like(discriminator_probability), discriminator_probability)

disc = Discriminator()
gen = Generator()

gan = GAN(discriminator=disc, generator=gen, latent_dim=latent_dim)
gan.compile(
disc_optimizer=disc_optimizer,
gen_optimizer=gen_optimizer,
loss_fn=tf.keras.losses.BinaryCrossentropy(),
generator_loss = generator_loss,
discriminator_loss = discriminator_loss
)

Training

We will now train, we will create a trianing callback so we can see how our GAN progresses.

class GANPredictionCallback(tf.keras.callbacks.Callback):
def __init__(self, gen_model, latent_dim):
self.latent_dim = latent_dim
self.gen_model = gen_model

def on_epoch_end(self, epoch, logs=None):
if epoch % 10 == 0:
print(f'\nEpoch {epoch + 1} Predictions:')
fig, axes = plt.subplots(1, 5, figsize=(15, 5))

for i in range(5):
noise = tf.random.normal([1, self.latent_dim])
logits = self.gen_model.predict(noise, verbose=0)[0]
argmax_indices = np.argmax(logits, axis=-1).astype(int)
image = np.array([[top_k_colors[idx] for idx in row] for row in argmax_indices])

# Display the image in the corresponding subplot
axes[i].imshow(image)
axes[i].axis('off') # Hide axis

# Show the plot
plt.tight_layout()
plt.show()
prediction_callback = GANPredictionCallback(
gen_model=gen,
latent_dim=latent_dim
)

gan.fit(one_hot_encoded_images * 0.9, epochs=epochs, callbacks=[prediction_callback])

We fuzz the real images by multiplying the one hot encoded images by 0.9 to help stabilize training as our discriminator would often beat our generator quickly and make it hard for the generator to learn and recover.

At the first stage, the generated image is just a bunch of noise.

By the 20th epoch, our generator learns to make the silhouette focused around the center of the image.

Later in the generation process, we generate some shapes that look like they could be real pokemon.

Training this GAN was difficult since we had to balance the generator and discriminator losses, if one outpaced the other then it would make it hard for the weaker one to learn. In general GAN training is quite unstable and is why diffusion models have become the preferred way to generate image content.

Summary

  • GANs consist of a discriminator and generator network which fight each other and grow together.
  • GANs are unstable and difficult to get to train properly

Conclusion

In this article, we explored CNNs and their use cases in content generation. CNNs are great at understanding grid like relationships in multiple dimensions which we can see in tile based levels or sprite images from games. We built a VAE backed with CNN layers and a GAN which used CNNs.

--

--

Matthew MacFarquhar
Matthew MacFarquhar

Written by Matthew MacFarquhar

I am a software engineer working for Amazon living in SF/NYC.

No responses yet