Procedural Content Generation With Sequence Based Neural Networks
Introduction
I have been reading this book on Procedural Content Generation and its integration with ML. As I go through this book, I will try my best to implement the concepts we go over in small, easily implementable examples. In this chapter, we will explore sequence based networks — namely Auto-Regressive LSTMs, Seq2Seq LSTMs and Transformers — in order to model and generate sequential content.
Sequential content — in this context — refers to ordered data where each element depends on or is influenced by the preceding elements, such as time-series data, text, or audio. We can massage lots of data into sequential content (e.x. assign tokens to a level’s columns feed that through a sequential network). For our running example in this article, we will be synthesizing Yu-Gi-Oh playing cards.
The Notebooks used in this article can be found in the below repository.
RNNs
LSTMs (Long Short-Term Memory) networks and GRUs (Gated Recurrent Unit) networks are both RNNs (Recurrent Neural Networks). The vanilla RNN is pretty terrible at remembering context and is never used outside of toy examples. GRUs and LSTMs function differently but the book states that
“a GRU with approximately 33% more cells than an LSTM will perform roughly the same (they will take up similar amounts of memory and computation time and produce similar results).”
So they can be thought of as close, interchangeable siblings. GRUs and LSTMs essentially introduce paths for memory from previous cells to flow through the network so that we mitigate issues with vanishing and exploding gradients.
Auto-Regressive LSTM
Our first model will be an Auto-Regressive LSTM. An autoregressive LSTM is an LSTM model where each output at a given time step is predicted based on the previous outputs in a sequential manner, feeding its own predictions back into the model to predict the next time step in a sequence. This structure allows the model to generate sequences by using prior predictions as part of its input for future predictions.
Processing Data
We will be using this Yu-Gi-Oh trading card data set I found on Kaggle https://www.kaggle.com/datasets/ioexception/yugioh-cards.
Note: if you do have access to a GPU, now would be a good time to use it (or run this code in a Colab notebook). Alternatively, you could reduce the number of entries we take from the dataframe so your network does not need to train over as much data.
df = pd.read_csv("cards.csv", usecols=['name', 'type', 'atk', 'def', 'level', 'race'])
df['atk'] = df['atk'].fillna(0)
df['def'] = df['def'].fillna(0)
df['level'] = df['level'].fillna(0)
DELIM_TOKEN = "DELIM"
df['text'] = df.apply(lambda row: f"{DELIM_TOKEN} {row['name']} {DELIM_TOKEN} {row['type']} {DELIM_TOKEN} ATK {int(row['atk'])} {DELIM_TOKEN} DEF {int(row['def'])} {DELIM_TOKEN} Level {int(row['level'])} {DELIM_TOKEN} {row['race']} {DELIM_TOKEN}", axis=1)
print (len(df))
entries = df['text']
print(entries[:10])
We are going to read in the cards.csv data and fill in some of the empty fields with 0s in columns atk, def and level (these were for spell and trap cards).
Then, we are going to transform our card data into a sentence structure which will feed into our LSTM.
Now that we have our sentences we are going to tokenize our entire vocabulary to associate each word with a number.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(entries)
total_words = len(tokenizer.word_index) + 1
print(total_words) # VOCAB SIZE
print(tokenizer.to_json())
Now we will construct our training data. We turn our sentence into a list of numerical tokens using our tokenizer. Then, we create sequences from the tokens. If a sentence has 25 tokens we will create 24 sequences of lengths 2,3,4,…,25.
input_sequences = []
for entry in entries:
token_list = tokenizer.texts_to_sequences([entry])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
print(len(input_sequences))
print(input_sequences[:10])
max_sequence_len = max([len(x) for x in input_sequences])
print(max_sequence_len)
We do need all our input to be the same length for our model so we will pre-pend padding so all our sequences are equal to the max_sequence_len we found.
Then we will create X and Y data where X is the tokens [0,N-1) and Y is the last token. Essentially, we are given all the preceding tokens and tryng to predict the next one.
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
predictors, label = input_sequences[:,:-1], input_sequences[:,-1]
label = to_categorical(label, num_classes=total_words)
The Model
Our LSTM model is a simple network with an embedding layer turning our input tokens into a 16-dimensional vector, single LSTM cell and a softmax loss function which will spit out probabilities across our vocabulary for the next token.
model = Sequential()
model.add(Embedding(total_words, 16, input_shape=(max_sequence_len,)))
model.add(LSTM(100))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()
We will train for 50 epochs
model.fit(predictors, label, epochs=50, batch_size=64)
And when we are finished, we can create some sample results. We do this by generating until we see 7 “delim”s and slowly growing our text as we predict more next tokens. I like to sample the next token using the distributions instead of taking the argmax because it yields different cards each time for the same starting text (although there is a chance we get some non-sensical cards as well).
text = "delim conjuring"
delims_seen = 2
while delims_seen < 7:
token_list = tokenizer.texts_to_sequences([text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted = model.predict(token_list, verbose=0)[0]
predicted_index = np.random.choice(len(predicted), p=predicted)
out_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted_index:
out_word = word
break
if out_word == "delim":
delims_seen += 1
text += " " + out_word
cleaned_text = text.replace("delim", "|")
print(cleaned_text)
Below is a completely new card I generated.
| conjuring blustering magician | effect monster | atk 1500 | def 2100 | level 6 |
Summary
- Tokenizing text is an important preprocess step for any sequence based task
- Auto-Regressive LSTMs predict the next token given a list of previous tokens and can be run in a loop to synthesize entire sequences from a starting point
Sequence2Sequence
Sometimes, we do not want to just generate content sequentially, but instead perform some translation task, for this we use networks like Seq2Seq.
A sequence-to-sequence (Seq2Seq) model consists of an encoder that processes an input sequence into a fixed-size context vector and a decoder that generates an output sequence based on this context. It is commonly used for tasks like machine translation or summarization, where the input and output sequences can have different lengths.
We will train a model which will take in a card with partial pieces filled out and generate a completed card. This is useful for content repair and fill in the blank interactive content generation systems.
Processing Data
Like in the Auto-Regressive LSTM, we will clean and restructure our data.
df = pd.read_csv("cards.csv", usecols=['name', 'type', 'atk', 'def', 'level', 'race'])
df['atk'] = df['atk'].fillna(0)
df['def'] = df['def'].fillna(0)
df['level'] = df['level'].fillna(0)
START_TOKEN = "<START>"
END_TOKEN = "<END>"
MASK_TOKEN = "<UNK>"
SPECIAL_TOKENS = [START_TOKEN, END_TOKEN]
df['text'] = df.apply(lambda row: f"{START_TOKEN} {row['name']} DELIM {row['type']} DELIM {int(row['atk'])} DELIM {int(row['def'])} DELIM {int(row['level'])} DELIM {row['race']} {END_TOKEN}", axis=1)
print(len(df['text']))
input_entries = df['text']
target_entries = input_entries.copy()
print(len(input_entries))
print(input_entries[:10])
We will then Mask some of our input words to simulate corrupted/missing data.
masking_prob = 0.3
masked_entries = []
for entry in input_entries:
tokens = entry.split(' ')
masked_entry = []
for token in tokens:
if token not in SPECIAL_TOKENS and np.random.rand() < masking_prob:
masked_entry.append(MASK_TOKEN)
else:
masked_entry.append(token)
masked_entries.append(" ".join(masked_entry))
masked_entries = pd.Series(masked_entries)
print(masked_entries[:10])
Then, we tokenize
tokenizer = Tokenizer(oov_token=MASK_TOKEN, filters="")
tokenizer.fit_on_texts(input_entries)
total_words = len(tokenizer.word_index) + 1
print(total_words)
print(tokenizer.to_json())
input_sequences_encoded = tokenizer.texts_to_sequences([ t for t in masked_entries])
target_sequences_encoded = tokenizer.texts_to_sequences([ t for t in target_entries])
max_sequence_len = max([len(x) for x in input_sequences_encoded])
print(max_sequence_len)
and pad
encoder_sequences = np.array(pad_sequences(input_sequences_encoded, maxlen=max_sequence_len, padding='post'))[:,1:]
target_sequences_padded = np.array(pad_sequences(target_sequences_encoded, maxlen=max_sequence_len, padding='post'))
decoder_input_sequences = target_sequences_padded[:,:-1]
decoder_output_sequences = target_sequences_padded[:,1:]
print(encoder_sequences.shape)
print(decoder_input_sequences.shape)
print(decoder_output_sequences.shape)
max_sequence_len = max_sequence_len - 1
In this case we are padding at the end, since the model works on data all at once and doesn’t start at the beginning like the Auto-Regressive LSTM.
We also extract 3 data vectors
X_encoder: the masked input sequences starting after the first token (<START>)
X_decoder: the unmasked true last tokens
Y_decoder: the unmasked true next token
Our network will create an encoder generated context vector and use that alongside the last N-1 tokens to predict the Nth token.
The Model
latent_dim = 256
embedding_dim = 128
# ENCODER
encoder_inputs= Input(shape=(max_sequence_len,))
encoder_embedding = Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=embedding_dim)(encoder_inputs)
encoder_embedding = LayerNormalization()(encoder_embedding)
encoder_embedding = Dropout(0.5)(encoder_embedding)
encoder_lstm = LSTM(latent_dim, return_state=True, recurrent_dropout=0.5)
_, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]
#DECODER
decoder_inputs = Input(shape=(max_sequence_len,))
decoder_embedding = Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=embedding_dim)(decoder_inputs)
decoder_embedding = LayerNormalization()(decoder_embedding)
decoder_embedding = Dropout(0.5)(decoder_embedding)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True, recurrent_dropout=0.5)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_outputs = LayerNormalization()(decoder_outputs)
decoder_outputs = Dropout(0.5)(decoder_outputs)
decoder_dense = Dense(len(tokenizer.word_index) + 1, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-4)
model.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer)
model.summary()
Our model will have a latent dimension of size 512 and an embedding size of 150.
Our Encoder will take in or masked data and spit out 2 vectors of length 512 which are the context used by the decoder.
state_h
(Hidden State): This is the short-term memory that holds information about the previous output at a given time step. It is used to generate the output at each time step and is passed to the next LSTM unit in the sequence.state_c
(Cell State): This is the long-term memory that stores information over longer time periods. It helps the LSTM retain important information across the entire sequence and regulates what information to remember or forget.
Our decoder will take in our previous tokens, and then pass them and our 512 dimension context vectors into the decoder LSTM to generate a next token prediction.
Training
I wanted to see how the model is progressing during training, so for this model I added a callback
class Seq2SeqPredictionCallback(tf.keras.callbacks.Callback):
def __init__(self, input_sequences, target_sequences, target_tokenizer, encoder_model, decoder_model, sample_size=5):
self.input_sequences = input_sequences
self.target_sequences = target_sequences
self.target_tokenizer = target_tokenizer
self.encoder_model = encoder_model
self.decoder_model = decoder_model
self.sample_size = sample_size
def decode_sequence(self, input_seq):
# Encode the input as state vectors
states_value = self.encoder_model.predict(input_seq, verbose=0)
# Generate empty target sequence of length 1
target_seq = np.zeros((1, 1))
target_seq[0, 0] = self.target_tokenizer.word_index['<start>']
# Sampling loop
stop_condition = False
decoded_sentence = ''
while not stop_condition:
output_tokens, h, c = self.decoder_model.predict([target_seq] + states_value, verbose=0)
# Sample a token
sampled_token_index = np.argmax(output_tokens[0, -1, :])
if sampled_token_index != 0:
sampled_word = self.target_tokenizer.index_word[sampled_token_index]
decoded_sentence += ' ' + sampled_word
else:
decoded_sentence += ' pad'
# Exit condition: either hit max length or find stop token
if sampled_word == '<end>' or len(decoded_sentence) > 500:
stop_condition = True
# Update the target sequence and states
target_seq = np.zeros((1, 1))
target_seq[0, 0] = sampled_token_index
states_value = [h, c]
return decoded_sentence
def on_epoch_end(self, epoch, logs=None):
if epoch % 10 == 0:
print(f'\nEpoch {epoch + 1} Predictions:')
for i in range(self.sample_size):
choice = random.randint(0, len(self.input_sequences))
input_seq = self.input_sequences[choice:choice+1]
decoded_sentence = self.decode_sequence(input_seq)
actual_sentence = ' '.join([self.target_tokenizer.index_word[index] if index > 0 else "pad" for index in self.target_sequences[choice]])
print(f'Input {choice}: {input_seq}')
print(f'Predicted: {decoded_sentence}')
print(f'Actual: {actual_sentence}\n')
This callback just synthesizes some example card generations every 10 epochs.
We need to pass in a separate encoder and decoder to sample generation since we don’t have ground truth decoder input sequences (so our generations will be used as the decoder input sequences as we create them).
We will share layers with the model we will train to create two networks. The encoder is very easy and we can just use the layers already in the model. For the decoder, we will need to create inputs for what would have been fed in by the encoder states, and then we can re-use the decoder embedding, LSTM and dense layers to produce a new model
encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_state_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_lstm_output, state_h, state_c = decoder_lstm(decoder_embedding, initial_state=decoder_state_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_lstm_output)
decoder_model = Model(
[decoder_inputs] + decoder_state_inputs,
[decoder_outputs] + decoder_states
)
We can instantiate or callbacks and fit our model like the above.
prediction_callback = Seq2SeqPredictionCallback(
input_sequences=encoder_sequences,
target_sequences=decoder_input_sequences,
target_tokenizer=tokenizer,
encoder_model=encoder_model,
decoder_model=decoder_model,
sample_size=3 # Number of samples to display each epoch
)
model.fit([encoder_sequences, decoder_input_sequences], decoder_output_sequences, epochs=30, batch_size=64, callbacks=[prediction_callback], validation_split=0.1)
Generation
The generation is basically exactly the same as what we did in our callback function. The only difference is that we provide our own input instead of sampling from the training data.
test_example = ["evil <UNK> king delim <UNK> monster delim 5000 delim 2000 delim <UNK> delim fiend"]
encoded_test_example = tokenizer.texts_to_sequences([ t for t in test_example])
input_test_example = np.array(pad_sequences(encoded_test_example, maxlen=max_sequence_len, padding='post'))
states_value = encoder_model.predict(input_test_example, verbose=0)
target_seq = np.zeros((1,1))
target_seq[0,0] = tokenizer.word_index[START_TOKEN.lower()]
stop_condition = False
decoded_card = ''
words = 0
while not stop_condition:
output_tokens, h, c = decoder_model.predict([target_seq] + states_value, verbose=0)
max_index = np.argmax(output_tokens[0, 0])
sampled_token_index = np.argmax(output_tokens[0, 0])
if sampled_token_index != 0:
sampled_word = tokenizer.index_word[sampled_token_index]
decoded_card += " " + sampled_word
if sampled_word == '<end>' or words > 50:
stop_condition = True
states_value = [h, c]
target_seq = np.zeros((1,1))
target_seq[0,0] = sampled_token_index
words += 1
print(decoded_card.replace("delim", "|"))
Below is a result I generated.
evil eye | fusion monster | 2000 | 1200 | 6 | fiend <end>
As you can see we did not do a great job, this is the difficulty with RNNs they often forget information that was given in the distant past. For example, this model forgot that I had given the card a title of “evil __ king” it remembered evil but seemed to throw away king, it also disregarded the attack and defense I gave the card.
Summary
- We can train a Seq2Seq model to fill in blanks for given data
- recurrent systems often have issues remembering details from past time steps far in the future
Transformer
The Transformer network is the successor to the recurrent based Seq2Seq model. The Transformer network introduces attention layers (allowing different generation steps to “attend” to pieces of the previous text). The Transformer architecture also removes the recurrent layers and uses positional encodings to embed positional information into its input. This allows the network to be trained in parallel which was not possible in RNN networks which needed to have data fed in sequentially.
The Model
The Transformer network uses the same inputs and outputs as the Seq2Seq model so we will skip the pre-processing step. You can see we have encoding functions and layers to mimic the structure of the above diagram which encodes input and adds self attention and fully connected layers in the encoder. Then, the outputs go into to decoder which uses masked attention and cross attention modules to finally generate output.
# Hyperparameters
num_layers=2
embedding_dim = 256
num_heads = 12
ff_dim = 512
dropout_rate=0.3
def positional_encoding(length, depth):
# depth is dimensionality of encoding, length is input sequence length
depth = depth/2
positions = np.arange(length)[:, np.newaxis] # (length, 1)
depths = np.arange(depth)[np.newaxis, :]/depth # (1, depth)
angle_rates = 1 / (10000**depths) # (1, depth)
angle_rads = positions * angle_rates # (pos, depth)
pos_encoding = np.concatenate([np.sin(angle_rads), np.cos(angle_rads)], axis=1)
return tf.cast(pos_encoding, dtype=tf.float32)
#Encoder Embedding
encoder_inputs = Input(shape=(max_sequence_len,))
context = tf.keras.layers.Embedding(len(tokenizer.word_index) + 1, embedding_dim)(encoder_inputs)
context += positional_encoding(len(tokenizer.word_index) + 1, embedding_dim)[tf.newaxis, :max_sequence_len, :]
# Encoder
context = tf.keras.layers.Dropout(dropout_rate)(context)
for i in range(num_layers):
attention_output = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embedding_dim, dropout=dropout_rate)(query=context, key=context, value=context)
context = tf.keras.layers.Add()([context, attention_output])
context = tf.keras.layers.LayerNormalization()(context)
sequential = tf.keras.Sequential([
tf.keras.layers.Dense(ff_dim, activation='relu'),
tf.keras.layers.Dense(embedding_dim),
tf.keras.layers.Dropout(dropout_rate)
])
context = tf.keras.layers.Add()([context, sequential(context)])
context = tf.keras.layers.LayerNormalization()(context)
# Decoder Embedding
decoder_inputs = Input(shape=(max_sequence_len,))
x = tf.keras.layers.Embedding(len(tokenizer.word_index) + 1, embedding_dim)(decoder_inputs)
x = x + positional_encoding(len(tokenizer.word_index) + 1, embedding_dim)[tf.newaxis, :max_sequence_len, :]
#Decoder
x = tf.keras.layers.Dropout(dropout_rate)(x)
for i in range(num_layers):
causal_attention_output = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embedding_dim, dropout=dropout_rate)(query=x, key=x, value=x, use_causal_mask = True)
x = tf.keras.layers.Add()([x, causal_attention_output])
x = tf.keras.layers.LayerNormalization()(x)
cross_attention_output = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embedding_dim, dropout=dropout_rate)(query=x, key=context, value=context)
x = tf.keras.layers.Add()([x, cross_attention_output])
x = tf.keras.layers.LayerNormalization()(x)
sequential = tf.keras.Sequential([
tf.keras.layers.Dense(ff_dim, activation='relu'),
tf.keras.layers.Dense(embedding_dim),
tf.keras.layers.Dropout(dropout_rate)
])
x = tf.keras.layers.Add()([x, sequential(x)])
x = tf.keras.layers.LayerNormalization()(x)
# Final Out
out = tf.keras.layers.Dense(len(tokenizer.word_index) + 1, activation='softmax')(x)
model = Model([encoder_inputs, decoder_inputs], out)
optimizer = Adam(learning_rate=0.0001)
model.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer)
model.summary()
Training
To train, we will inject a callback to see how we are progressing.
class TransformerPredictionCallback(tf.keras.callbacks.Callback):
def __init__(self, input_sequences, target_sequences, target_tokenizer, transformer_model, max_sequence_len, sample_size=5):
self.input_sequences = input_sequences
self.target_sequences = target_sequences
self.target_tokenizer = target_tokenizer
self.transformer_model = transformer_model
self.max_sequence_len = max_sequence_len
self.sample_size = sample_size
def decode_sequence(self, input_seq):
# Generate empty target sequence of length 1
target_seq = np.zeros((1, 1))
target_seq[0, 0] = tokenizer.word_index['<start>']
# Sampling loop
decoded_sentence = ''
for i in range(max_sequence_len):
output_tokens = model([input_seq, np.array(pad_sequences(target_seq, maxlen=self.max_sequence_len, padding='post'))], training=False)
# Sample a token
sampled_token_index = np.argmax(output_tokens[0, i, :])
if sampled_token_index != 0:
sampled_word = tokenizer.index_word[sampled_token_index]
else:
sampled_word = 'pad'
decoded_sentence += ' ' + sampled_word
# Exit condition: either hit max length or find stop token
if sampled_word == '<end>':
break
new_value = np.zeros((1, 1))
new_value[0, 0] = sampled_token_index
target_seq = np.concatenate([target_seq, new_value], axis=1)
return decoded_sentence
def on_epoch_end(self, epoch, logs=None):
if epoch % 10 == 0:
print(f'\nEpoch {epoch + 1} Predictions:')
for i in range(self.sample_size):
choice = random.randint(0, len(self.input_sequences))
input_seq = self.input_sequences[choice:choice+1]
decoded_sentence = self.decode_sequence(input_seq)
actual_sentence = ' '.join([self.target_tokenizer.index_word[index] if index > 0 else "pad" for index in self.target_sequences[choice]])
print(f'Input {choice}: {input_seq}')
print(f'Predicted: {decoded_sentence}')
print(f'Actual: {actual_sentence}\n')
and then fit on our data
prediction_callback = TransformerPredictionCallback(
input_sequences=input_sequences_padded,
target_sequences=target_sequences_padded,
target_tokenizer=tokenizer,
transformer_model=model,
max_sequence_len=max_sequence_len,
sample_size=3 # Number of samples to display each epoch
)
model.fit([input_sequences_padded, target_sequences_shifted], target_sequences_padded, epochs=30, batch_size=32, validation_split=0.1, callbacks=[prediction_callback])
Generating
And now we can generate
test_example = ["powerful <UNK> skeleton DELIM Ritual Effect <UNK> DELIM 1000 DELIM 3000 DELIM <UNK> DELIM fiend"]
encoded_test_exmaple = tokenizer.texts_to_sequences([ t for t in test_example])
input_test_example = np.array(pad_sequences(encoded_test_exmaple, maxlen=max_sequence_len, padding='post'))
target_seq = np.zeros((1, 1))
target_seq[0, 0] = tokenizer.word_index['<start>']
# Sampling loop
decoded_sentence = ''
for i in range(max_sequence_len):
output_tokens = model([input_test_example, np.array(pad_sequences(target_seq, maxlen=max_sequence_len, padding='post'))], training=False)
# Sample a token
sampled_token_index = np.argmax(output_tokens[0, i, :])
if sampled_token_index != 0:
sampled_word = tokenizer.index_word[sampled_token_index]
else:
sampled_word = 'pad'
decoded_sentence += ' ' + sampled_word
# Exit condition: either hit max length or find stop token
if sampled_word == '<end>':
stop_condition = True
new_value = np.zeros((1, 1))
new_value[0, 0] = sampled_token_index
target_seq = np.concatenate([target_seq, new_value], axis=1)
print(decoded_sentence)
A generated example looks like this
<start> salamangreat talker delim ritual monster delim 1000 delim 3000 delim 12 delim fiend <end> pad pad pad pad pad pad pad
It looks like our Transformer may be too small to fully understand the language structure of the Yu-Gi-Oh cards. We may need to increase the model parameters and find some more training data in order to get some better results.
Alternatively — something I will explore in an upcoming article — we could experiment with fine-tuning an already trained GPT to fill in card fields. This will allow us to leverage a more capable Transformer that was trained using the GPU powers and data available to large companies.
Summary
- Transformers remove reccurent layers and use positional embeddings to encode relative positions of tokens
- Transformers add attention mechanisms to allow token generations to attend to different parts of the input
- We need lots of parameters and training data to make an effective Transformer
Conclusion
In this article, we learned some approaches for modeling sequential generation tasks, we started with basic auto-regressive networks to sequentially generate content. We then explored Seq2Seq models and Transformers to learn the task of filling in masked content — these two models may need some more model tuning or extended training data to really start to work for us.