Skip to main content

Kernel Development

Wafer provides a complete workflow for GPU kernel development—from understanding the baseline, through writing and testing custom kernels, to optimizing performance with AI assistance.

Workflow

1

Discover Baseline

Find out what kernel PyTorch dispatches for your operation.
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024
2

Write Your Kernel

Implement a custom kernel in CUDA, Triton, or HIP.
3

Evaluate Correctness

Test your kernel against the reference.
wafer evaluate gpumode --impl ./kernel.py --reference ./reference.py
4

Profile Performance

Measure and analyze kernel behavior.
wafer nvidia ncu analyze ./profile.ncu-rep
5

Optimize

Use AI assistance to improve performance.
wafer agent -t optimize-kernel --args kernel=./kernel.cu "Optimize for H100"
6

Analyze Roofline

Understand performance relative to hardware limits.
wafer roofline --gpu H100 --bytes 1e9 --flops 1e12 --time-ms 0.5

Tools

Evaluate

Test kernel correctness and benchmark performance.

Baseline

Discover what kernels PyTorch dispatches.

Roofline

Analyze performance against hardware limits.

Corpus

Download GPU documentation for reference.

Kernel Formats

Wafer supports two kernel formats:

GPUMode Format

Simple function-based format:
# reference.py
def ref_kernel(x):
    return x * 2

# kernel.py
import triton
import triton.language as tl

@triton.jit
def double_kernel(x_ptr, out_ptr, n, BLOCK: tl.constexpr):
    pid = tl.program_id(0)
    offs = pid * BLOCK + tl.arange(0, BLOCK)
    mask = offs < n
    x = tl.load(x_ptr + offs, mask=mask)
    tl.store(out_ptr + offs, x * 2, mask=mask)

def custom_kernel(x):
    out = torch.empty_like(x)
    n = x.numel()
    double_kernel[(n + 1023) // 1024,](x, out, n, BLOCK=1024)
    return out

KernelBench Format

Class-based format compatible with KernelBench:
# reference.py
class Model(torch.nn.Module):
    def forward(self, x):
        return torch.softmax(x, dim=-1)

# solution.py
class ModelNew(torch.nn.Module):
    def forward(self, x):
        return custom_softmax(x)

AI-Assisted Development

Use the Wafer agent throughout your workflow:
# Ask about optimization
wafer agent "How do I reduce bank conflicts in shared memory?"

# Analyze a trace
wafer agent -t trace-analyze --args trace=./profile.ncu-rep "What's the bottleneck?"

# Optimize a kernel
wafer agent -t optimize-kernel --args kernel=./matmul.cu "Optimize for H100"

# Query documentation
wafer agent -c cuda "How do cooperative groups work?"

Quick Example

Complete workflow for optimizing a matrix multiplication:
# 1. Check baseline performance
wafer baseline run "torch.matmul(A, B)" \
  --shape A=4096,4096 \
  --shape B=4096,4096 \
  --hardware H100

# 2. Create template files
wafer evaluate make-template ./matmul

# 3. Edit kernel.py with your implementation
# ...

# 4. Test correctness
wafer evaluate gpumode \
  --impl ./matmul/kernel.py \
  --reference ./matmul/reference.py

# 5. Benchmark
wafer evaluate gpumode \
  --impl ./matmul/kernel.py \
  --reference ./matmul/reference.py \
  --benchmark

# 6. Profile for optimization opportunities
ncu --set full -o matmul python -c "import kernel; kernel.benchmark()"
wafer nvidia ncu analyze ./matmul.ncu-rep

# 7. Get AI optimization suggestions
wafer agent -t optimize-kernel --args kernel=./matmul/kernel.py "Improve occupancy"

Remote Execution

Run on cloud GPUs or your own hardware:
# On a workspace
wafer workspaces create --gpu H100 --name kernel-dev
wafer workspaces sync kernel-dev
wafer workspaces exec kernel-dev "wafer evaluate gpumode --impl kernel.py"

# On a configured target
wafer evaluate gpumode --impl kernel.py --target h100-box

Next Steps

Evaluate Kernels

Start testing your kernels.

AI Agent

Get AI assistance.

NVIDIA Profiling

Profile NVIDIA kernels.

AMD Profiling

Profile AMD kernels.