Kernel Development

Wafer provides a complete workflow for GPU kernel development—from understanding the baseline, through writing and testing custom kernels, to optimizing performance with AI assistance.

Workflow

Discover Baseline

Find out what kernel PyTorch dispatches for your operation.

wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024

Write Your Kernel

Implement a custom kernel in CUDA, Triton, or HIP.

Evaluate Correctness

Test your kernel against the reference.

wafer evaluate gpumode --impl ./kernel.py --reference ./reference.py

Profile Performance

Measure and analyze kernel behavior.

wafer nvidia ncu analyze ./profile.ncu-rep

Optimize

Use AI assistance to improve performance.

wafer agent -t optimize-kernel --args kernel=./kernel.cu "Optimize for H100"

Analyze Roofline

Understand performance relative to hardware limits.

wafer roofline --gpu H100 --bytes 1e9 --flops 1e12 --time-ms 0.5

Tools

Evaluate

Test kernel correctness and benchmark performance.

Baseline

Discover what kernels PyTorch dispatches.

Roofline

Analyze performance against hardware limits.

Corpus

Download GPU documentation for reference.

Kernel Formats

Wafer supports two kernel formats:

GPUMode Format

Simple function-based format:

# reference.py
def ref_kernel(x):
    return x * 2

# kernel.py
import triton
import triton.language as tl

@triton.jit
def double_kernel(x_ptr, out_ptr, n, BLOCK: tl.constexpr):
    pid = tl.program_id(0)
    offs = pid * BLOCK + tl.arange(0, BLOCK)
    mask = offs < n
    x = tl.load(x_ptr + offs, mask=mask)
    tl.store(out_ptr + offs, x * 2, mask=mask)

def custom_kernel(x):
    out = torch.empty_like(x)
    n = x.numel()
    double_kernel[(n + 1023) // 1024,](x, out, n, BLOCK=1024)
    return out

KernelBench Format

Class-based format compatible with KernelBench:

# reference.py
class Model(torch.nn.Module):
    def forward(self, x):
        return torch.softmax(x, dim=-1)

# solution.py
class ModelNew(torch.nn.Module):
    def forward(self, x):
        return custom_softmax(x)

AI-Assisted Development

Use the Wafer agent throughout your workflow:

# Ask about optimization
wafer agent "How do I reduce bank conflicts in shared memory?"

# Analyze a trace
wafer agent -t trace-analyze --args trace=./profile.ncu-rep "What's the bottleneck?"

# Optimize a kernel
wafer agent -t optimize-kernel --args kernel=./matmul.cu "Optimize for H100"

# Query documentation
wafer agent -c cuda "How do cooperative groups work?"

Quick Example

Complete workflow for optimizing a matrix multiplication:

# 1. Check baseline performance
wafer baseline run "torch.matmul(A, B)" \
  --shape A=4096,4096 \
  --shape B=4096,4096 \
  --hardware H100

# 2. Create template files
wafer evaluate make-template ./matmul

# 3. Edit kernel.py with your implementation
# ...

# 4. Test correctness
wafer evaluate gpumode \
  --impl ./matmul/kernel.py \
  --reference ./matmul/reference.py

# 5. Benchmark
wafer evaluate gpumode \
  --impl ./matmul/kernel.py \
  --reference ./matmul/reference.py \
  --benchmark

# 6. Profile for optimization opportunities
ncu --set full -o matmul python -c "import kernel; kernel.benchmark()"
wafer nvidia ncu analyze ./matmul.ncu-rep

# 7. Get AI optimization suggestions
wafer agent -t optimize-kernel --args kernel=./matmul/kernel.py "Improve occupancy"

Remote Execution

Run on cloud GPUs or your own hardware:

# On a workspace
wafer workspaces create --gpu H100 --name kernel-dev
wafer workspaces sync kernel-dev
wafer workspaces exec kernel-dev "wafer evaluate gpumode --impl kernel.py"

# On a configured target
wafer evaluate gpumode --impl kernel.py --target h100-box

Next Steps

Evaluate Kernels

Start testing your kernels.

AI Agent

Get AI assistance.

NVIDIA Profiling

Profile NVIDIA kernels.

AMD Profiling

Profile AMD kernels.

Getting Started

CLI

AI Agent

Kernel Development

NVIDIA Profiling

NCU Profiler

Perfetto

AMD Profiling

ROCprofiler Compute

Infrastructure

Compare

Onboarding

More

Kernel Development Overview

Kernel Development

Workflow

Tools

Evaluate

Baseline

Roofline

Corpus

Kernel Formats

GPUMode Format

KernelBench Format

AI-Assisted Development

Quick Example

Remote Execution

Next Steps

Evaluate Kernels

AI Agent

NVIDIA Profiling

AMD Profiling

Getting Started

CLI

AI Agent

Kernel Development

NVIDIA Profiling

NCU Profiler

Perfetto

AMD Profiling

ROCprofiler Compute

Infrastructure

Compare

Onboarding

More

​Kernel Development

​Workflow

​Tools

Evaluate

Baseline

Roofline

Corpus

​Kernel Formats

​GPUMode Format

​KernelBench Format

​AI-Assisted Development

​Quick Example

​Remote Execution

​Next Steps

Evaluate Kernels

AI Agent

NVIDIA Profiling

AMD Profiling

Kernel Development

Workflow

Tools

Kernel Formats

GPUMode Format

KernelBench Format

AI-Assisted Development

Quick Example

Remote Execution

Next Steps