Maximizing Performance: Running .metal Code on Modern GPUs

7 min readSep 19, 2023

Introduction

In the world of computing, the pursuit of speed and performance has always been a driving force. With the ever-increasing complexity of tasks in fields such as scientific computing, graphics rendering, artificial intelligence, and more, traditional central processing units (CPUs) alone may no longer suffice. This is where the Graphics Processing Unit (GPU) enters the scene as a game-changer.

GPU, once primarily designed for rendering graphics, has evolved into a parallel processing powerhouse that can handle a wide range of computationally intensive tasks. Harnessing the full potential of GPUs requires the development of specialized code that can tap into their immense parallelism and computational capabilities. This is where writing GPU code becomes not just a valuable skill but a necessity in today’s high-performance computing landscape.

In this article, we’ll embark on a journey into the world of GPU programming. We’ll explore what GPU code is, why it’s essential, and how it differs from traditional CPU programming.

GPU vs CPU

So what is the difference exactly between the GPU (Graphical Processing Unit) and CPU (Central Processing Unit)?

CPUs have a few powerful cores optimized for sequential processing. They are designed for general-purpose computing and can handle a wide range of tasks with complex control flows and branching. The CPU cores are great at single threaded logic and have faster clock speed than the GPU but since there are far fewer (around 8–12 vs the 1000–2500+ for GPU), they do not work well with parallelism.
GPUs have thousands of smaller, simpler cores optimized for parallel processing. They are designed for data-parallel and compute-intensive tasks, making them highly efficient for tasks that can be parallelized. However, their clock speed per core are slower and a single GPU core is far less powerful than the CPU core.

A good analogy is to think of a CPU as a car and a GPU like a jet, a car is versatile and can be used to get you lots of places — the grocery store, the gym, school, etc… The jet– on the other hand– can get you across the country great (and a lot faster than a car) but is probably not going to be great at taking you down the road to pick up some milk.

How does writing code for the GPU work?

So how can we execute some select (highly parallel code) on the GPU? Well there are a few file formats used to run code on the GPU. .cu runs on NVIDIA CUDA GPUs, .cl is written for use on various GPU architectures using the OpenCL specification and .metal is for an iOS framework which combines some functionality from OpenGL and OpenCL. We will be writing code using .metal in this article.

Once we have some code written in these formats, we can have our CPU run code to ask the GPU to run whatever function we have written for it them. We do so by first getting the GPU device and compiling our GPU function, then we will put data in the shared GPU/CPU buffer and finally we execute the GPU code and grab the result back from the shared buffers.

Threads, Blocks and Grids

These are the three most important building blocks for GPU executions.

Thread: A thread represents the smallest unit of work in a GPU kernel. Each thread typically executes the same code but may process different data or follow different paths within the code. Threads are organized into grids and blocks.
Thread Block: A thread block is a group of threads that can cooperate and synchronize with each other. Threads within the same block can communicate and share data through shared memory, making them suitable for collaborative tasks. Block size is defined during kernel launch and is often chosen based on the characteristics of the problem and the hardware. Common block sizes are 128, 256, or 512 threads, but the optimal block size depends on the specific GPU and problem.
Grid: A grid is a collection of thread blocks. Threads within different blocks of the same grid cannot directly communicate or synchronize with each other. The grid size is also specified during kernel launch and defines the number of thread blocks needed to execute the kernel. The organization of threads into blocks and grids allows for efficient parallel execution across the entire GPU device.

Code

Set up Code

import MetalKit

let count: Int = 50000000

let array1: [Float] = getRandomArray()
let array2: [Float] = getRandomArray()

print("REGULAR:")
regular(arr1: array1, arr2: array2)
print("COMPUTE:")
compute(arr1: array1, arr2: array2)

This code will first run our array addition using a basic single threaded for loop, and then use the GPU computation method for parallelizing the computation.

func getRandomArray() ->[Float] {
    var result = [Float].init(repeating: 0.0, count: count)
    for i in 0..<count {
        result[i] = Float(arc4random_uniform(10))
    }
    return result
}

We use this code to generate a large list (whatever we set count as) of random number between 0 to 10

Regular For Loop

func regular(arr1: [Float], arr2: [Float]) {
    let startTime = CFAbsoluteTimeGetCurrent()
    var result: [Float] = [Float].init(repeating: 0.0, count: count)
    
    for i in 0..<count {
        result[i] = arr1[i] + arr2[i]
    }
    
    let timeElapsed = CFAbsoluteTimeGetCurrent() - startTime
    print("Time Taken: \(String(format:"%.05f", timeElapsed)) seconds")
}

Nothing really special here, we are just looping over our two arrays and summing into one result array.

Time Taken: 44.99706 seconds

GPU Computing

#include <metal_stdlib>
using namespace metal;

kernel void addition_function(constant float *arr1 [[buffer(0)]], constant float *arr2 [[buffer(1)]], device float *resultArr [[buffer(2)]],
                              uint index [[thread_position_in_grid]]) {
    resultArr[index] = arr1[index] + arr2[index];
}

This is our compute.metal code which will be executed on the GPU. The GPU code is indicated by the kernel function and we give it the name addition_function. It will take in three float array buffers (two for the arrays to add together and one for the result buffer), we also take in a value for the position of the thread we are using to execute this particular function (this will also allow us to get the corresponding entry in the two arrays we need to add together).

func compute(arr1: [Float], arr2: [Float]) {
    let startTime = CFAbsoluteTimeGetCurrent()
    
    //1. set up GPU and the metal function
    let gpuDevice = MTLCreateSystemDefaultDevice()
    let commandQueue = gpuDevice?.makeCommandQueue()
    let gpuFunctions = gpuDevice?.makeDefaultLibrary()
    let addFunction = gpuFunctions?.makeFunction(name: "addition_function")
    
    //2. set up pipeline for the function
    var additionPipelineState: MTLComputePipelineState!
    do {
        additionPipelineState = try gpuDevice?.makeComputePipelineState(function: addFunction!)
    } catch {
        print(error)
    }
    
    //3. create the shared buffers
    let arr1Buf = gpuDevice?.makeBuffer(bytes: arr1, length: MemoryLayout<Float>.size * count,options: .storageModeShared)
    let arr2Buf = gpuDevice?.makeBuffer(bytes: arr2, length: MemoryLayout<Float>.size * count,options: .storageModeShared)
    let resultBuf = gpuDevice?.makeBuffer(length: MemoryLayout<Float>.size * count,options: .storageModeShared)
    
    //4. create command encoder
    let commandBuffer = commandQueue?.makeCommandBuffer()
    let commandEncoder = commandBuffer?.makeComputeCommandEncoder()
    commandEncoder?.setComputePipelineState(additionPipelineState)
    commandEncoder?.setBuffer(arr1Buf, offset: 0, index: 0)
    commandEncoder?.setBuffer(arr2Buf, offset: 0, index: 1)
    commandEncoder?.setBuffer(resultBuf, offset: 0, index: 2)
    
    //5. set up threads
    let threadsPerGrid = MTLSize(width: count, height: 1, depth: 1)
    let maxThreadsPerThreadGroup = additionPipelineState.maxTotalThreadsPerThreadgroup
    let threadsPerThreadGroup = MTLSize(width: maxThreadsPerThreadGroup, height: 1, depth: 1)
    
    //6. execute
    commandEncoder?.dispatchThreads(threadsPerGrid, threadsPerThreadgroup: threadsPerThreadGroup)
    commandEncoder?.endEncoding()
    commandBuffer?.commit()
    commandBuffer?.waitUntilCompleted()
    
    //7. get results and print first 3
    var resultBufferPointer = resultBuf?.contents().bindMemory(to: Float.self, capacity: MemoryLayout<Float>.size * count)
    for i in 0..<3 {
        print("\(arr1[i]) + \(arr2[i]) = \(Float(resultBufferPointer!.pointee) as Any)")
        resultBufferPointer = resultBufferPointer?.advanced(by: 1)
    }
    
    let timeElapsed = CFAbsoluteTimeGetCurrent() - startTime
    print("Time Taken: \(String(format:"%.05f", timeElapsed)) seconds")
}

To run this parallel addition, we have a lot more overhead.

The first thing we do is get our default GPU device and create our addition_function that our .metal code declared.

Second, we set up our GPU pipeline which we will use to execute our addition function.

Next, we put our arrays in the shared buffer so that our GPU code can access the values our CPU created.

Then, we created our command objects which we will use to tell the GPU to execute our code and we configure mappings to our float array buffers we just created.

We then create our thread grid, we have Count number of threads in our grid and we let the pipeline tell us how many threads we will have per thread group (a.k.a how many threads per thread block).

Now, we use the Command Encoder and Buffer to tell our threads to execute our add function and wait for them all to complete.

Once completed, we grab the result buffer from the shared buffer store, it now has the result values in it so we will print the first 3.

0.0 + 4.0 = 4.0
9.0 + 7.0 = 16.0
8.0 + 7.0 = 15.0
Time Taken: 0.58681 seconds

As you can see, this executes a lot faster than our regular iterative for loop pattern and — thanks to parallelization — the time taken will not scale linearly with the count like the regular for loop does.

There is a caveat to using the GPU for problems like this. Since the GPU takes a decent amount of time to setup, parallel computation is often slower than single threaded computation for small work loads. For instance, the below time output was produced for adding two arrays of size 10000. In this case, we are better of just doing all our computation on the CPU.

REGULAR:
Time Taken: 0.00905 seconds
COMPUTE:
Time Taken: 0.09852 seconds

Conclusion

In conclusion, GPU programming and parallel execution are very powerful when it comes to doing lots and lots of independent processes at the same time. However, they should be used as a highly specialized computation intensive tools and are not suitable on tasks that are single threaded in nature (like control flow logic) OR when the parallelized computation amount is relatively low (like adding a two smaller number lists together).