Procedural Content Generation with Transformers and Diffusion

12 min read4 days ago

Introduction

I have been reading this book on Procedural Content Generation and its integration with ML. This final article in the series goes over two pieces of extra content not featured in the book — but I thought they would be good additions. We will expand on the work done in Procedural Content Generation With Sequence Based Neural Network and fine tune gpt-2 to create Yu-Gi-Oh cards. Then, we will use a diffusion model to generate pokemon silhouettes (like we did with our GAN in Procedural Content Generation With Grid Based Neural Networks).

The Notebooks used in this article can be found in the below repository.

PCGML/extra_content_pcgml at master · mattmacf98/PCGML

Contribute to mattmacf98/PCGML development by creating an account on GitHub.

github.com

GPT-2 Fine Tuning

In this first section, we are going to take in the already trained GPT-2 model (which is specialized on sentence completion), and use LORA to fine tune it on our Yu-Gi-Oh card generation task.

Background

GPT is the decoder side from the original Transformers Paper, later research showed that the two pieces functioned fine individually, creating two branches of transformer based research.

LoRA (Low-Rank Adaptation) is a technique for fine-tuning large models like GPT-2 by adding trainable low-rank matrices to specific layers, while freezing the original parameters. This approach reduces the number of parameters that need updating, making fine-tuning more efficient in terms of memory and computational resources. LoRA allows for faster adaptation to new tasks without the need for extensive retraining of the entire model.

Preparing Data

We will need to load in and transform our Yu-Gi-Oh data into a sequence of words to feed into GPT-2 and generate input and target entries.

df = pd.read_csv("cards.csv", usecols=['name', 'type', 'atk', 'def', 'level', 'race'])
df['atk'] = df['atk'].fillna(0)
df['def'] = df['def'].fillna(0)
df['level'] = df['level'].fillna(0)

MASK_TOKEN = "<UNK>"


df['text'] = df.apply(lambda row: f"{row['name']} | {row['type']} | {int(row['atk'])} | {int(row['def'])} | {int(row['level'])} | {row['race']}", axis=1)

print(len(df['text']))
input_entries = df['text']
target_entries = input_entries.copy()
print(len(input_entries))

We will then mask our input entries.

masking_prob = 0.2
masked_entries = []
for entry in input_entries:
    tokens = entry.split(' ')
    masked_entry = []
    for token in tokens:
        if np.random.rand() < masking_prob:
            masked_entry.append(MASK_TOKEN)
        else:
            masked_entry.append(token)
    masked_entries.append(" ".join(masked_entry))

masked_entries = pd.Series(masked_entries)
print(masked_entries[:10])

Since GPT-2 was trained on sentence completion, we will need to massage our masked_entry -> target_entry task to look like sentence completion.

X = np.array([masked_entries[i] + " -> " + target_entries[i] + "<END>" for i in range(len(masked_entries))])
print(X.shape)
print(X[0])

Now, our model will hopefully learn that when it sees the ‘->’, it should try to generate a target entry out of the masked input entry.

Loading the Pre-trained Model

We will now load in GPT-2 and its pre-trained weights — which we freeze— along with its tokenizer which we will use to tokenize our data. We will initialize our LORA config to generate a low rank approximation matrix of rank 16 and apply it to our loaded GPT-2 model.

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    device_map='auto',
).to(device)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

for param in model.parameters():
    param.requires_grad = False

# LoRa
config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM")

model = get_peft_model(model, config).to(device)

Training

To train, we first need to set up a custom data loader for our data.

class TextDataset(Dataset):
    def __init__(self, X, tokenizer, max_length=128):
        """
        Args:
        - X: A list of input sequences (strings) for next-token prediction.
        - tokenizer: The tokenizer for processing the text.
        - max_length: The maximum length of the tokenized input/output.
        """
        self.X = X
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        # Get the input sequence
        x = self.X[idx]

        # Tokenize the input sequence
        tokenized = self.tokenizer(x.strip(), truncation=True, padding="max_length", max_length=self.max_length, return_tensors="pt")

        input_ids = tokenized['input_ids'].squeeze()  # Remove batch dimension
        attention_mask = tokenized['attention_mask'].squeeze()

        # Labels for next-token prediction are the same as input_ids but shifted to the right
        labels = input_ids.clone()
        labels[labels == self.tokenizer.pad_token_id] = -100  # Ignore padding in the loss computation

        return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

We take in the X data we created earlier (masked_entry -> target_entry), and run it through GPT-2’s tokenizer to extract the input_ids, the attention mask and the next-token prediction labels.

Now, we can create a data loader using our new TextLoader

dataset = TextDataset(X, tokenizer)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4)

And we can train our model.

model.train()
epochs = 3
for epoch in range(epochs):
    loop = tqdm(dataloader, leave=True)
    for batch in loop:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # Forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        # Update progress bar
        loop.set_description(f'Epoch {epoch+1}')
        loop.set_postfix(loss=loss.item())

Generating

After that is all done, we are now able to load up and test our model.

model.eval()

input_text = "Dark <UNK> | <UNK> | 3600 | <UNK> | <UNK> | fiend ->"
inputs = tokenizer(input_text, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(
        inputs["input_ids"],
        max_length=100,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Text: ", generated_text)

Below is an example generated run

Generated Text: Dark <UNK> | <UNK> | 3600 | <UNK> | <UNK> | fiend -> Dark Angel | Effect Monster | 3600 | 2400 | 6 | Fiend<END> -> Dark Angel | Effect Monster | 3600 | 2400 | 6 | Fiend<END><END><END><END><END><END><END><END><END><END><END><END><END>

It couldn’t really figure out when to stop, but the first card generated after our initial masked input looks quite good! The model was able to pull in context and pay attention from different parts of the masked input pretty well. For example, when I changed our type from fiend to dragon, we then got a dark dragon instead of an angel as the name.

Summary

Fine tuning a GPT-2 model which was pre-trained on english language worked a lot better than our previous attempt of creating our own transformer
LORA allows us to fine tune a lot faster by reducing the number of parameters we train for our task with respect to the number of parameters in the overall model

Diffusion

In this section, we are going to try to generate silhouettes of pokemon. Unlike in our previous article, we will not be using a GAN, instead, we will use a diffusion based model.

Background

A lot of what we use in this part of the tutorial is based on an earlier article I have written about diffusion models. We will gloss over the in depth process details of the model here.

Diffusion in 2D image generation involves gradually adding noise to an image until it becomes random, then training a model to reverse this process by denoising step-by-step to recover the original image. This approach allows the model to generate new images by learning how to iteratively refine random noise into coherent visuals. This denoising task is significantly easier than the GAN’s job of taking noise and outputting a coherent image, for this reason, diffusion models have preformed much better than GANs in industry and are the new state of the art.

Loading Data

Since we are using our own custom dataset, we will need to create a custom ImageDataLoader.

class ImageDataset(Dataset):
    def __init__(self, image_folder, transform=None):
        self.image_folder = image_folder
        self.transform = transform
        self.image_paths = [os.path.join(image_folder, f) for f in os.listdir(image_folder) if f.endswith(('.png', '.jpg', '.jpeg'))]

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        image = Image.open(img_path).convert('RGB')  # Convert to RGB if the images are not grayscale
        if self.transform:
            image = self.transform(image)
        return image

In practice, this is quite simple, all we need to do is load all the images in a given directory and preform a given transformation on them.

Building The Model

The model architecture we will build is a Unet with attention modules in the down, mid and up blocks. We also inject a time embedding into our model data so the model knows how close the input is to completely denoised.

def get_time_embedding(time_steps, t_emb_dim):
    factor = 10000 ** (torch.arange(start=0, end=t_emb_dim // 2, device=time_steps.device) / (t_emb_dim // 2))

    t_emb = time_steps[:, None].repeat(1, t_emb_dim // 2) / factor
    return torch.cat([torch.sin(t_emb), torch.cos(t_emb)], dim=-1)


class DownBlock(nn.Module):
    def __init__(self, in_channels, out_channels, t_emb_dim, down_sample, num_heads):
        super().__init__()
        self.down_sample = down_sample

        self.resnet_conv_first = nn.Sequential(
            nn.GroupNorm(8, in_channels),
            nn.SiLU(),
            nn.Conv2d(in_channels, out_channels, kernel_size=(3, 3), stride=(1, 1), padding=1)
        )

        self.t_emb_layer = nn.Sequential(
            nn.SiLU(),
            nn.Linear(t_emb_dim, out_channels)
        )

        self.resnet_conv_second = nn.Sequential(
            nn.GroupNorm(8, out_channels),
            nn.SiLU(),
            nn.Conv2d(out_channels, out_channels, kernel_size=(3, 3), stride=(1, 1), padding=1)
        )

        self.attention_norm = nn.GroupNorm(8, out_channels)
        self.attention = nn.MultiheadAttention(out_channels, num_heads, batch_first=True)
        self.residual_input_conv = nn.Conv2d(in_channels, out_channels, kernel_size=(1, 1))
        self.down_sample_conv = nn.Conv2d(out_channels, out_channels, kernel_size=(4, 4), stride=(2, 2), padding=1) if self.down_sample else nn.Identity()

    def forward(self, x, t_emb):
        out = x

        # Resnet block
        resnet_input = out
        out = self.resnet_conv_first(out)
        out = out + self.t_emb_layer(t_emb)[:, :, None, None]
        out = self.resnet_conv_second(out)
        out = out + self.residual_input_conv(resnet_input)

        # Attention block
        batch_size, channels, h, w = out.shape
        in_attn = out.reshape(batch_size, channels, h*w)
        in_attn = self.attention_norm(in_attn)
        in_attn = in_attn.transpose(1, 2)
        out_attn, _ = self.attention(in_attn, in_attn, in_attn)
        out_attn = out_attn.transpose(1,2).reshape(batch_size, channels, h, w)
        out = out + out_attn

        out = self.down_sample_conv(out)
        return out


class MidBlock(nn.Module):
    def __init__(self, in_channels, out_channels, t_emb_dim, num_heads):
        super().__init__()

        self.resnet_conv_first = nn.ModuleList([
            nn.Sequential(
                nn.GroupNorm(8, in_channels),  # Why 8?
                nn.SiLU(),
                nn.Conv2d(in_channels, out_channels, kernel_size=(3, 3), stride=(1, 1), padding=1)
            ),
            nn.Sequential(
                nn.GroupNorm(8, out_channels),  # Why 8?
                nn.SiLU(),
                nn.Conv2d(out_channels, out_channels, kernel_size=(3, 3), stride=(1, 1), padding=1)
            )
        ])


        self.t_emb_layer = nn.ModuleList([
            nn.Sequential(
                nn.SiLU(),
                nn.Linear(t_emb_dim, out_channels)
            ),
            nn.Sequential(
                nn.SiLU(),
                nn.Linear(t_emb_dim, out_channels)
            )
        ])

        self.resnet_conv_second = nn.ModuleList([
            nn.Sequential(
                nn.GroupNorm(8, out_channels),  # Why 8?
                nn.SiLU(),
                nn.Conv2d(out_channels, out_channels, kernel_size=(3, 3), stride=(1, 1), padding=1)
            ),
            nn.Sequential(
                nn.GroupNorm(8, out_channels),  # Why 8?
                nn.SiLU(),
                nn.Conv2d(out_channels, out_channels, kernel_size=(3, 3), stride=(1, 1), padding=1)
            )
        ])


        self.attention_norm = nn.GroupNorm(8, out_channels)
        self.attention = nn.MultiheadAttention(out_channels, num_heads, batch_first=True)
        self.residual_input_conv = nn.ModuleList([
            nn.Conv2d(in_channels, out_channels, kernel_size=(1, 1)),
            nn.Conv2d(out_channels, out_channels, kernel_size=(1, 1))
        ])

    def forward(self, x, t_emb):
        out = x

        # first Resnet block
        resnet_input = out
        out = self.resnet_conv_first[0](out)
        out = out + self.t_emb_layer[0](t_emb)[:, :, None, None]
        out = self.resnet_conv_second[0](out)
        out = out + self.residual_input_conv[0](resnet_input)

        # Attention block
        batch_size, channels, h, w = out.shape
        in_attn = out.reshape(batch_size, channels, h*w)
        in_attn = self.attention_norm(in_attn)
        in_attn = in_attn.transpose(1, 2)
        out_attn, _ = self.attention(in_attn, in_attn, in_attn)
        out_attn = out_attn.transpose(1,2).reshape(batch_size, channels, h, w)
        out = out + out_attn

        # second ResnetBlock
        resnet_input = out
        out = self.resnet_conv_first[1](out)
        out = out + self.t_emb_layer[1](t_emb)[:, :, None, None]
        out = self.resnet_conv_second[1](out)
        out = out + self.residual_input_conv[1](resnet_input)

        return out


class UpBlock(nn.Module):
    def __init__(self, in_channels, out_channels, t_emb_dim, up_sample, num_heads):
        super().__init__()
        self.up_sample = up_sample

        self.resnet_conv_first = nn.Sequential(
            nn.GroupNorm(8, in_channels),  # Why 8?
            nn.SiLU(),
            nn.Conv2d(in_channels, out_channels, kernel_size=(3, 3), stride=(1, 1), padding=1)
        )

        self.t_emb_layer = nn.Sequential(
            nn.SiLU(),
            nn.Linear(t_emb_dim, out_channels)
        )

        self.resnet_conv_second = nn.Sequential(
            nn.GroupNorm(8, out_channels),  # Why 8?
            nn.SiLU(),
            nn.Conv2d(out_channels, out_channels, kernel_size=(3, 3), stride=(1, 1), padding=1)
        )

        self.attention_norm = nn.GroupNorm(8, out_channels)
        self.attention = nn.MultiheadAttention(out_channels, num_heads, batch_first=True)
        self.residual_input_conv = nn.Conv2d(in_channels, out_channels, kernel_size=(1, 1))
        self.up_sample_conv = nn.ConvTranspose2d(in_channels // 2, in_channels // 2, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)) if self.up_sample else nn.Identity()

    def forward(self, x, out_down, t_emb):
        x = self.up_sample_conv(x)
        x = torch.cat([x, out_down], dim=1)
        out = x

        # Resnet block
        resnet_input = out
        out = self.resnet_conv_first(out)
        out = out + self.t_emb_layer(t_emb)[:, :, None, None]
        out = self.resnet_conv_second(out)
        out = out + self.residual_input_conv(resnet_input)

        # Attention block
        batch_size, channels, h, w = out.shape
        in_attn = out.reshape(batch_size, channels, h*w)
        in_attn = self.attention_norm(in_attn)
        in_attn = in_attn.transpose(1, 2)
        out_attn, _ = self.attention(in_attn, in_attn, in_attn)
        out_attn = out_attn.transpose(1,2).reshape(batch_size, channels, h, w)
        out = out + out_attn

        return out


class Unet(nn.Module):
    def __init__(self, im_channels):
        super().__init__()
        self.down_channels = [32, 64, 128, 256]
        self.mid_channels = [256, 256, 128]
        self.up_channels = [128, 64, 32, 16]
        self.t_emb_dim = 128
        self.down_sample = [True, True, False]
        self.up_sample = [False, True, True]

        self.t_proj = nn.Sequential(
            nn.Linear(self.t_emb_dim, self.t_emb_dim),
            nn.SiLU(),
            nn.Linear(self.t_emb_dim, self.t_emb_dim),
        )
        self.conv_in = nn.Conv2d(im_channels, self.down_channels[0], kernel_size=(3, 3), padding=1)

        self.downs = nn.ModuleList([])
        for i in range(len(self.down_channels) - 1):
            self.downs.append(DownBlock(self.down_channels[i], self.down_channels[i+1], self.t_emb_dim, down_sample=self.down_sample[i], num_heads=4))

        self.mids = nn.ModuleList([])
        for i in range(len(self.mid_channels) - 1):
            self.mids.append(MidBlock(self.mid_channels[i], self.mid_channels[i+1], self.t_emb_dim, num_heads=4))

        self.ups = nn.ModuleList([])
        for i in range(len(self.up_channels) - 1):
            self.ups.append(UpBlock(self.up_channels[i] * 2, self.up_channels[i+1], self.t_emb_dim, up_sample=self.up_sample[i], num_heads=4))

        self.norm_out = nn.GroupNorm(8, 16)
        self.conv_out = nn.Conv2d(16, im_channels, kernel_size=(3, 3), padding=1)

    def forward(self, x, t):
        out = self.conv_in(x)
        t_emb = get_time_embedding(t, self.t_emb_dim)
        t_emb = self.t_proj(t_emb)

        down_outs = []
        for down in self.downs:
            down_outs.append(out)
            out = down(out, t_emb)

        for mid in self.mids:
            out = mid(out, t_emb)

        for up in self.ups:
            down_out = down_outs.pop()
            out = up(out, down_out, t_emb)

        out = self.norm_out(out)
        out = nn.SiLU()(out)
        out = self.conv_out(out)
        return out

Our model will take in some noisy image, embed the time step in the denoising process and then pass this image through our down mid and up blocks. The item that comes out of the model should be a matrix of the size of the image which represents the noise that should be removed to get to the previous original un-noisy image.

Training

During our training, we will need to inject some noise into our image so that our model can learn to denoise it and we will need to remove some predicted noise to get to our previous timestep (i.e. the image with less noise). We will create a noise scheduler to help us with that task.

class LinearNoiseScheduler:
    def __init__(self, num_timesteps, beta_start, beta_end):
        self.num_timesteps = num_timesteps
        self.beta_start = beta_start
        self.beta_end = beta_end

        self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        self.alphas = 1. - self.betas
        self.alpha_cum_prod = torch.cumprod(self.alphas, dim=0)
        self.sqrt_alpha_cum_prod = torch.sqrt(self.alpha_cum_prod)
        self.sqrt_alpha_cum_prod = self.sqrt_alpha_cum_prod.to(device)
        self.sqrt_one_minus_alpha_cum_prod = torch.sqrt(1. - self.alpha_cum_prod)
        self.sqrt_one_minus_alpha_cum_prod = self.sqrt_one_minus_alpha_cum_prod.to(device)

    def add_noise(self, original, noise, t):
        original_shape = original.shape
        batch_size = original_shape[0]

        sqrt_alpha_cum_prod = self.sqrt_alpha_cum_prod[t].reshape(batch_size)
        sqrt_one_minus_alpha_cum_prod = self.sqrt_one_minus_alpha_cum_prod[t].reshape(batch_size)

        for _ in range(len(original_shape)-1):
            sqrt_alpha_cum_prod = sqrt_alpha_cum_prod.unsqueeze(-1)
            sqrt_one_minus_alpha_cum_prod = sqrt_one_minus_alpha_cum_prod.unsqueeze(-1)

        return sqrt_alpha_cum_prod*original + sqrt_one_minus_alpha_cum_prod * noise

    def sample_prev_timestep(self, xt, noise_pred, t):
        x0 = (xt - self.sqrt_one_minus_alpha_cum_prod[t] * noise_pred) / self.sqrt_alpha_cum_prod[t]
        x0 = torch.clamp(x0, -1., 1.)

        mean = xt - ((self.betas[t] * noise_pred) / self.sqrt_one_minus_alpha_cum_prod[t])
        mean = mean / torch.sqrt(self.alphas[t])

        if t == 0:
            return mean, x0
        else:
            variance = (1 - self.alpha_cum_prod[t-1]) / (1 - self.alpha_cum_prod[t])
            variance = variance * self.betas[t]
            sigma = variance ** 0.5
            z = torch.randn(xt.shape).to(xt.device)
            return mean + sigma*z,  x0

With this set up we can now train our diffusion model.

def train(args):
    diffusion_config = args['diffusion_params']
    model_config = args['model_params']
    train_config = args['train_params']

    scheduler = LinearNoiseScheduler(num_timesteps=diffusion_config['num_timesteps'], beta_start=diffusion_config['beta_start'], beta_end=diffusion_config['beta_end'])

    trfms = transforms.Compose([
        transforms.Resize((28, 28)),
        transforms.Grayscale(num_output_channels=1),
        transforms.ToTensor(),
        transforms.Normalize((0.5,), (0.5,))
    ])

    dataset = ImageDataset(image_folder="pokemon/", transform=trfms)
    data_loader = DataLoader(dataset, batch_size=train_config['batch_size'], shuffle=True)\

    model = Unet(model_config['im_channels']).to(device)
    model.train()


    num_epochs = train_config['num_epochs']
    optimizer = Adam(model.parameters(), lr=train_config['lr'])
    criterion = torch.nn.MSELoss()

    for epoch_idx in range(num_epochs):
        losses = []
        for im in tqdm(data_loader):
            optimizer.zero_grad()
            im = im.float().to(device)

            noise = torch.randn_like(im).to(device)
            t = torch.randint(0, diffusion_config['num_timesteps'], (im.shape[0],)).to(device)

            noisy_im = scheduler.add_noise(im, noise, t)
            noise_pred = model(noisy_im, t)

            loss = criterion(noise_pred, noise)
            losses.append(loss.item())
            loss.backward()
            optimizer.step()

        print('Finished epoch:{} | Loss: {:.4f}'.format(epoch_idx + 1, np.mean(losses)))
    return model

model = train({
    "diffusion_params": {
        "num_timesteps": 1000,
        "beta_start": 0.00001,
        "beta_end": 0.01
    },
    "model_params": {
        "im_channels": 1,
        "im_size": 28
    },
    "train_params": {
        "batch_size": 32,
        "num_epochs": 150,
        "num_samples": 10,
        "num_grid_rows": 10,
        "lr": 0.0001
    }
})

Our training process will first transform our input RGB images into 28x28 gray scale images which we then normalize between -0.5 and 0.5. Then, we will instantiate our model and start up the training loop.

For each image we train on, we first add some noise to it based on the timestep t we pick (randomly). Then, we pass the noisy image to our model and let the model predict the noise to be removed. We can then compute our loss between the predicted noise and the actual noise we added.

Generating

Now that we have trained our diffusion model to predict and remove noise, we can sample it to generate new images.

def sample(model, scheduler, train_config, model_config, diffusion_config):
    r"""
    Sample stepwise by going backward one timestep at a time.
    We save the x0 predictions
    """
    xt = torch.randn((train_config['num_samples'],
                      model_config['im_channels'],
                      model_config['im_size'],
                      model_config['im_size'])).to(device)
    for i in tqdm(reversed(range(diffusion_config['num_timesteps']))):
        # Get prediction of noise
        noise_pred = model(xt, torch.as_tensor(i).unsqueeze(0).to(device))

        # Use scheduler to get x0 and xt-1
        xt, _ = scheduler.sample_prev_timestep(xt, noise_pred, torch.as_tensor(i).to(device))

        # Save x0
        ims = torch.clamp(xt, -1., 1.).detach().cpu()
        ims = (ims + 1) / 2
        grid = make_grid(ims, nrow=train_config['num_grid_rows'])
        img = torchvision.transforms.ToPILImage()(grid)
        if not os.path.exists(os.path.join('out')):
            os.mkdir(os.path.join('out'))
        if i % 100 == 0:
          img.save(os.path.join('out', 'x0_{}.png'.format(i)))
        img.close()

def infer(args, model):
    diffusion_config = args['diffusion_params']
    model_config = args['model_params']
    train_config = args['train_params']

    model.eval()

    scheduler = LinearNoiseScheduler(num_timesteps=diffusion_config['num_timesteps'], beta_start=diffusion_config['beta_start'], beta_end=diffusion_config['beta_end'])

    with torch.no_grad():
        sample(model, scheduler, train_config, model_config, diffusion_config)

infer({
    "diffusion_params": {
        "num_timesteps": 1000,
        "beta_start": 0.00001,
        "beta_end": 0.01
    },
    "model_params": {
        "im_channels": 1,
        "im_size": 28
    },
    "train_params": {
        "batch_size": 32,
        "num_epochs": 150,
        "num_samples": 10,
        "num_grid_rows": 10,
        "lr": 0.0001
    }
}, model)

To preform a a sample, we create some random noise in the shape of our target images. Then, we predict the amount of noise to remove and move in that direction one timestep in the past. We continue this process until we are at timestep 0, indicating we have removed all our noise.

Below is an example progression of the diffusion sampling. Some of these started to look a little like a blurry pokemon, however some others were not so great.

Summary

Diffusion models learn the task of removing noise from an image which is much simpler than a GAN’s job of synthesizing an image from noise.
Diffusion synthesizes backwards through time, starting with noise and gradually removing it until they are left with a base image.

Conclusion

Diffusion models and Transformers represent two examples of how newer models in industry can be applied to the PCGML techniques we have already learned. As the ML industry grows, so to does the ability for PCG pipelines to leverage this growth for its own endeavors.

Procedural Content Generation with Transformers and Diffusion

Introduction

PCGML/extra_content_pcgml at master · mattmacf98/PCGML

Contribute to mattmacf98/PCGML development by creating an account on GitHub.

GPT-2 Fine Tuning

Background

Preparing Data

Loading the Pre-trained Model

Training

Generating

Summary

Diffusion

Background

Loading Data

Building The Model

Training

Generating

Summary

Conclusion

Written by Matthew MacFarquhar