Procedural Content Generation With Neural Networks

Matthew MacFarquhar
9 min readSep 5, 2024

--

Introduction

I have been reading this book on Procedural Content Generation and its integration with ML. As I go through this book, I will try my best to implement the concepts we go over in small, easily implementable examples. In this chapter, we will create our first true Neural Networks for content generation using Tensorflow, we will cover some basic Deep Neural Networks for regression tasks, classification tasks and auto-encoding.

The Notebooks used in this article can be found in the below repository.

Predicting Levels

For our first Neural Network task, we will train on some example levels and generate new levels using our trained models. We will create two networks: a regression network which takes in two previous columns as context and spits out a predicted next column height and a classification network which takes in the tiles below and to the left of the current one and predicts the tile that it should be.

First, we will code up some setup required by both networks

def display_level_image(tile_dim, level, tile_name_to_image):
level_image = Image.new("RGB", (tile_dim[1] * len(level[0]), tile_dim[1] * len(level) ))
for row in range(len(level)):
for col in range(len(level[0])):
level_image.paste(tile_name_to_image[level[row][col]],(col * tile_dim[0], row * tile_dim[1], (col+1) * tile_dim[0], (row +1) *tile_dim[1]))

plt.figure()
plt.imshow(level_image)
plt.show()

This function takes in tile dimensions, a level — which is a 2D array of tile ids — and a tile id to image map.

tile_name_to_image = {}
tiles = ["A", "B", "C", "G", "K"]
for tile in tiles:
tile_name_to_image[tile] = Image.open(f'tiles/{tile}.png')

tile_dim = tile_name_to_image["A"].size
with open("levels.json", 'r') as file:
levels = json.load(file)

# pick any of the 4 levels to view
display_level_image(tile_dim, levels[0], tile_name_to_image)

We can display our 4 example levels we will train on

Regression (Height predictor)

Our goal for the Height predictor network is to take in the two preceding columns as context (X) and predict the next column height (Y).

We will first turn our example levels into training data, where X is the shape (num_levels, 2 * tiles_per_col) and Y is the shape (num_levels, 1)

embedding_map = {
"A": 0,
"B": 1,
"C": 2,
"G": 3,
"K": 4
}

X = []
Y = []
column_context = 2

for level in levels:
columns = []
for col in range(len(level[0])):
columns.append([embedding_map[level[i][col]] for i in range(len(level))])

for col in range(len(columns) - column_context):
x_entry = [columns[col + i] for i in range(column_context)]
X.append(x_entry)
first_non_empty = next((i for i, value in enumerate(columns[col + column_context]) if value != 0), len(columns[col + column_context]))
height = len(columns[col + column_context]) - first_non_empty
Y.append(height)

X = np.array(X)
X = X.reshape(-1, X.shape[1] * X.shape[2])

Y = np.array(Y)
Y = Y.reshape(-1,1)

print(X)
print(Y)

We map to numerical values for the Tile ids and synthesize our X and Y data using the ground truth extracted from the examples.

hidden_layer_nodes = 16
height_model = tf.keras.models.Sequential([
tf.keras.layers.Dense(hidden_layer_nodes, activation='relu', input_shape=(X.shape[1],)),
tf.keras.layers.Dense(hidden_layer_nodes, activation='relu'),
tf.keras.layers.Dense(hidden_layer_nodes, activation='relu'),
tf.keras.layers.Dense(1)
])

height_model.compile(optimizer='adam',
loss='mean_squared_error')

height_model.fit(X, Y, epochs=500)

Finally, we create a Neural Network with 16 hidden nodes per layer and 3 hidden layers. Our final layer has one node which will spit out a predicted height and we fit the model using MSE between the predicted height and ground truth height.

We are definitely overfitting on the training data, if we had a lot more examples we could do a validation split to make our network more robust.

Classifier (Tile Picker)

Now we will build a classifier model which will take in the tile ids from below and to the left as context and predict the tile that should be placed.

We will generate one-hot encoder for our input and output tiles this time so it works better with our categorical cross-entropy loss. We also add in another class to represent an out of bounds tile to deal with border tiles. So our X will be of shape (num_samples, 12) 12 since we have 6 possible classes per tile and we take in the bottom and left tiles as input. However, since we do not want to predict an out of bounds tile, our Y will be of shape (num_samples, 5).

X = []
Y = []

for level in levels:
for row in range(len(level)):
for col in range(len(level[0])):
if row == len(level) - 1:
below = -1
else:
below = embedding_map[level[row+1][col]]
if col == 0:
left = -1
else:
left = embedding_map[level[row][col-1]]

x_left = [1 if i == left else 0 for i in range(-1, 5)]
x_below = [1 if i == below else 0 for i in range(-1, 5)]
x = x_left + x_below
X.append(x)

y = level[row][col]
Y.append([1 if i == embedding_map[y] else 0 for i in range(5)])

X = np.array(X)

Y = np.array(Y)
Y = Y.reshape(-1,5)

print(X.shape)
print(Y.shape)

Now we will build our model and train it using categorical cross-entropy loss.

hidden_layer_nodes = 8
num_categories = 5
content_model = tf.keras.models.Sequential([
tf.keras.layers.Dense(hidden_layer_nodes, activation='relu', input_shape=(X.shape[1],)),
tf.keras.layers.Dense(hidden_layer_nodes, activation='relu'),
tf.keras.layers.Dense(num_categories, activation='softmax')
])

content_model.compile(optimizer='adam',
loss='categorical_crossentropy')

content_model.fit(X, Y, epochs=250)

Creating Levels

Our level creation pipeline will use our two networks in tandem. First, we will start with two columns as context. We will use our height prediction network to predict how tall the next column should be and then for each of those tiles in the column, we use our classifier to predict the content (leaving all content above the height as sky).

def generate_level(context_cols, num_cols_to_generate, column_height, height_model, content_model):
level = context_cols

for i in range(num_cols_to_generate):
next_height = 8
#next_height = round(height_model.predict(np.array(context_cols).reshape(1,16), verbose=0)[0][0])
next_col = [0] * column_height
for i in range(next_height):
if i == 0:
below = -1
else:
below = next_col[column_height - i]
left = context_cols[1][column_height - 1 - i]
x_left = [1 if i == left else 0 for i in range(-1, 5)]
x_below = [1 if i == below else 0 for i in range(-1, 5)]
x = x_left + x_below

logits = content_model.predict(np.array(x).reshape(1,12), verbose=0)
content_pred = np.random.choice(np.array(range(5)), p=logits[0])
next_col[column_height - 1 - i] = content_pred
context_cols = [context_cols[1], next_col]
level.append(next_col)
return level

def translate_level(level):
reverse_embedding_map = ["A","B","C","G","K"]
transposed_level = [list(row) for row in zip(*level)]

for row in range(len(transposed_level)):
for col in range(len(transposed_level[0])):
transposed_level[row][col] = reverse_embedding_map[transposed_level[row][col]]

return transposed_level

We can then generate the levels after providing in some starter columns

context_cols = [[0, 0, 0, 0, 0, 0, 2, 4],
[0, 0, 0, 0, 0, 0, 2, 4]]
level = translate_level(generate_level(context_cols, 15, 8, height_model, content_model))
display_level_image(tile_dim, level, tile_name_to_image)

There was one caveat from our original goal in how we generated content (the commented out line). Using the height from the first network resulted in bad levels where dirt touched sky (example below).

This happened quite a bit since we artificially constrained the content predictor to only predict up to a certain height and fill the rest with sky. Ideally we might have taken in column height as a parameter during the training of our classifier so that we could pipe that value in during sample time.

Setting the height as a constant value eight generated much nicer levels (like below).

Summary

  • We extracted X and Y data from our example content to train regression and classification models
  • When we tried to combine the networks, we constrained the classifier with a height that it was unaware of which generated unideal output (we should have taken the height in as a feature to the classifier)

Auto-Encoder

An Auto-encoder attempts to train a representation of content in a lower dimensional latent space. This can allow us to compress content, sample that latent space for new content and interpolate between two pieces of content.

In our example, we will be creating an 8-dimensional latent space for our 136 tile (8 rows * 17 cols) dimension levels.

Our first step is to encode our levels as training samples X

embedding_map = {
"A": 0,
"B": 1,
"C": 2,
"G": 3,
"K": 4
}

levels = []
for i in range(16):
with open(f"generated_levels/{i}.json", 'r') as file:
level = json.load(file)
for row in range(len(level)):
for col in range(len(level[0])):
level[row][col] = embedding_map[level[row][col]]
levels.append(level)

X = np.array(levels).reshape(len(levels), -1)
print(X)
print(X.shape)

X will be of shape (num_levels, 136) where each entry is the tile ID.

We will put this data into our auto-encoder and train

inputLevel = Input(shape=(X.shape[1],))
encoded = Dense(128, activation='relu')(inputLevel)
encoded = Dense(64, activation='relu')(encoded)
encoded = Dense(32, activation='relu')(encoded)
encoded = Dense(16, activation='relu')(encoded)
encoded = Dense(8, activation='relu')(encoded)

decoded = Dense(16, activation='relu')(encoded)
decoded = Dense(32, activation='relu')(decoded)
decoded = Dense(64, activation='relu')(decoded)
decoded = Dense(128, activation='relu')(decoded)
decoded = Dense(X.shape[1] * 5, activation='softmax')(decoded)
decoded = Reshape((X.shape[1], 5))(decoded)

autoencoder = Model(inputLevel, decoded)

autoencoder.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

autoencoder.fit(X, X, epochs=500)

The auto encoder has layers that encode the content into an 8 dimensional vector and then decode it back into a (136, 5) dimension matrix (with probabilities for each of the five Tile IDs). Our loss is between our predicted tile ID distribution and the actual Tile ID. In this case our train data X is the same as our ground truth X — which is why an auto-encoder falls into the unsupervised family of Neural Network architectures.

Latent space

The goal of this network is really to learn a good latent space (the model as a whole just spits out what you put in and would not be helpful to use to predict anything).

We will use the trained layers to create two new models: an Encoder which takes our level and outputs a vector in our latent space and a Decoder which takes a latent space vector and outputs a level.

encoded_input = Input(shape=(8,))
decoder_layer = autoencoder.layers[-6](encoded_input)
decoder_layer = autoencoder.layers[-5](decoder_layer)
decoder_layer = autoencoder.layers[-4](decoder_layer)
decoder_layer = autoencoder.layers[-3](decoder_layer)
decoder_layer = autoencoder.layers[-2](decoder_layer)
decoder_layer = autoencoder.layers[-1](decoder_layer)
decoder = Model(encoded_input, decoder_layer)

encoder = Model(inputLevel, encoded)

Entries in the latent space are just vectors so we can do normal vector operations like interpolate between the encodings of two levels.

vector_one = encoder.predict(np.array(levels[0]).reshape(1, 8*17))[0]
vector_two = encoder.predict(np.array(levels[1]).reshape(1, 8*17))[0]

a = 0.5
vector = [vector_one[i] * a + vector_two[i] * (1-a) for i in range(8)]

and then decode the interpolations

gen = decoder.predict(np.array(vector).reshape(1,8))[0]

rows = 8
cols = 17

out_level = []
for row in range(rows):
level_row = []
for col in range(cols):
content_pred = np.argmax(gen[row * cols + col])
level_row.append(content_pred)
out_level.append(level_row)

result_level = translate_level(out_level)
display_level_image(tile_dim, result_level, tile_name_to_image)

below are interpolations between levels 0 and 1 at intervals (1,0), (0.75,0.25), (0.5,0.5), (0.25,0.75) (0,1)

As we can see, some of these levels are probably not valid, that is because our latent space may have holes and be very spread out. A common fix (and extension to this system) is to add a KL-divergence loss to make the space look like a standard normal. That way when we interpolate between two points, we have a higher chance to still be within the valid latent space.

This architecture is called a Variational Auto-Encoder.

Summary

  • We can learn a lower latent space for our content which can have vector math applied on it
  • Often times the learned latent space is not closed for some vector operations, augmenting our system to become a Variational Auto-Encoder could be a good next step to address this

Conclusion

In this article, we took our first go at building real neural networks for content generation, we learned about DNN for regression and classification tasks as well as building an Auto-Encoder to extract a content latent space. We also learned how to turn our data into training and test data that can be fed into our models. We will build upon this mindset of treating our content as X and Y data in the following articles where we venture into RNNs and CNNs.

--

--

Matthew MacFarquhar
Matthew MacFarquhar

Written by Matthew MacFarquhar

I am a software engineer working for Amazon living in SF/NYC.

No responses yet