Skip to main content

Kernel Development

Wafer provides a complete workflow for GPU kernel development—from understanding the baseline, through writing and testing custom kernels, to optimizing performance with AI assistance.

Workflow

1

Discover Baseline

Find out what kernel PyTorch dispatches for your operation.
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024
2

Write Your Kernel

Implement a custom kernel in CUDA, Triton, or HIP.
3

Evaluate Correctness

Test your kernel against the reference.
wafer evaluate gpumode --impl ./kernel.py --reference ./reference.py
4

Profile Performance

Measure and analyze kernel behavior.
wafer nvidia ncu analyze ./profile.ncu-rep
5

Optimize

Use AI assistance to improve performance.
wafer agent -t optimize-kernel --args kernel=./kernel.cu "Optimize for H100"
6

Analyze Roofline

Understand performance relative to hardware limits.
wafer roofline --gpu H100 --bytes 1e9 --flops 1e12 --time-ms 0.5

Tools

Kernel Formats

Wafer supports two kernel formats:

GPUMode Format

Simple function-based format:
# reference.py
def ref_kernel(x):
    return x * 2

# kernel.py
import triton
import triton.language as tl

@triton.jit
def double_kernel(x_ptr, out_ptr, n, BLOCK: tl.constexpr):
    pid = tl.program_id(0)
    offs = pid * BLOCK + tl.arange(0, BLOCK)
    mask = offs < n
    x = tl.load(x_ptr + offs, mask=mask)
    tl.store(out_ptr + offs, x * 2, mask=mask)

def custom_kernel(x):
    out = torch.empty_like(x)
    n = x.numel()
    double_kernel[(n + 1023) // 1024,](x, out, n, BLOCK=1024)
    return out

KernelBench Format

Class-based format compatible with KernelBench:
# reference.py
class Model(torch.nn.Module):
    def forward(self, x):
        return torch.softmax(x, dim=-1)

# solution.py
class ModelNew(torch.nn.Module):
    def forward(self, x):
        return custom_softmax(x)

AI-Assisted Development

Use the Wafer agent throughout your workflow:
# Ask about optimization
wafer agent "How do I reduce bank conflicts in shared memory?"

# Analyze a trace
wafer agent -t trace-analyze --args trace=./profile.ncu-rep "What's the bottleneck?"

# Optimize a kernel
wafer agent -t optimize-kernel --args kernel=./matmul.cu "Optimize for H100"

# Query documentation
wafer agent -c cuda "How do cooperative groups work?"

Quick Example

Complete workflow for optimizing a matrix multiplication:
# 1. Check baseline performance
wafer baseline run "torch.matmul(A, B)" \
  --shape A=4096,4096 \
  --shape B=4096,4096 \
  --hardware H100

# 2. Create template files
wafer evaluate make-template ./matmul

# 3. Edit kernel.py with your implementation
# ...

# 4. Test correctness
wafer evaluate gpumode \
  --impl ./matmul/kernel.py \
  --reference ./matmul/reference.py

# 5. Benchmark
wafer evaluate gpumode \
  --impl ./matmul/kernel.py \
  --reference ./matmul/reference.py \
  --benchmark

# 6. Profile for optimization opportunities
ncu --set full -o matmul python -c "import kernel; kernel.benchmark()"
wafer nvidia ncu analyze ./matmul.ncu-rep

# 7. Get AI optimization suggestions
wafer agent -t optimize-kernel --args kernel=./matmul/kernel.py "Improve occupancy"

Remote Execution

Run on cloud GPUs or your own hardware:
# On a workspace
wafer workspaces create --gpu H100 --name kernel-dev
wafer workspaces sync kernel-dev
wafer workspaces exec kernel-dev "wafer evaluate gpumode --impl kernel.py"

# On a configured target
wafer evaluate gpumode --impl kernel.py --target h100-box

Next Steps