Skip to main content

Kernel Evaluation

The wafer evaluate command tests your GPU kernel implementations for correctness and optionally benchmarks their performance. It supports two kernel formats: GPUMode and KernelBench.

Quick Start

# Evaluate a GPUMode kernel
wafer evaluate gpumode --impl ./my_kernel.py --reference ./reference.py

# Evaluate a KernelBench kernel
wafer evaluate kernelbench --impl ./solution.py --reference ./problem.py

Formats

GPUMode Format

GPUMode kernels define custom_kernel and ref_kernel functions:
# reference.py
def ref_kernel(input_tensor):
    return input_tensor * 2

# my_kernel.py
import torch
import triton
import triton.language as tl

@triton.jit
def kernel(input_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(input_ptr + offsets, mask=mask)
    tl.store(output_ptr + offsets, x * 2, mask=mask)

def custom_kernel(input_tensor):
    output = torch.empty_like(input_tensor)
    n_elements = input_tensor.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    kernel[grid](input_tensor, output, n_elements, BLOCK_SIZE=1024)
    return output

KernelBench Format

KernelBench uses a ModelNew class that replaces Model:
# problem.py (reference)
class Model(torch.nn.Module):
    def forward(self, x):
        return torch.softmax(x, dim=-1)

# solution.py (your implementation)
class ModelNew(torch.nn.Module):
    def forward(self, x):
        # Your optimized implementation
        return custom_softmax(x)

Commands

wafer evaluate gpumode

wafer evaluate gpumode [OPTIONS]
Options:
OptionShortDescription
--impl-iPath to implementation kernel file
--referencePath to reference kernel file
--test-casesPath to test cases JSON file
--target-tGPU target name
--benchmarkRun performance benchmarks
--profileEnable profiling
--defense/--no-defenseRun reward hack defense checks (default: enabled)
--gpu-idOverride GPU ID
Example with benchmarking:
wafer evaluate gpumode \
  --impl ./my_kernel.py \
  --reference ./reference.py \
  --benchmark \
  --target local

wafer evaluate kernelbench

wafer evaluate kernelbench [OPTIONS]
Options: Same as gpumode, plus format-specific options. Example:
wafer evaluate kernelbench \
  --impl ./solution.py \
  --reference ./23_Softmax.py \
  --benchmark

wafer evaluate make-template

Generate template files to get started:
# Create templates in current directory
wafer evaluate make-template

# Create templates in specific directory
wafer evaluate make-template ./my-kernel --force

Test Cases

Provide custom test cases via JSON:
{
  "test_cases": [
    {
      "name": "small",
      "inputs": {"x": {"shape": [32, 32], "dtype": "float16"}},
      "rtol": 1e-3,
      "atol": 1e-3
    },
    {
      "name": "large",
      "inputs": {"x": {"shape": [4096, 4096], "dtype": "float16"}},
      "rtol": 1e-3,
      "atol": 1e-3
    }
  ]
}
wafer evaluate gpumode --test-cases ./cases.json --impl ./kernel.py

Defense Checks

By default, evaluation includes defense checks to detect potential reward hacking:
  • Verifies implementation doesn’t just copy the reference
  • Checks for meaningful computation
  • Validates output shapes and dtypes
Disable with --no-defense if needed for debugging.

Remote Evaluation

Run on remote GPUs using targets:
# List available targets
wafer config targets list

# Evaluate on a specific target
wafer evaluate gpumode --impl ./kernel.py --target my-h100
See Targets for setting up remote GPUs.

Output

Successful evaluation shows:
✓ Correctness: PASSED (all test cases)
  - small: max_diff=1.2e-5, rtol=1e-3 ✓
  - large: max_diff=3.4e-4, rtol=1e-3 ✓

Performance (--benchmark):
  - Implementation: 0.45ms (±0.02ms)
  - Reference: 1.23ms (±0.05ms)
  - Speedup: 2.73x

Next Steps

Baseline Discovery

See what kernels PyTorch dispatches to.

Roofline Analysis

Analyze performance against hardware limits.

AI Agent

Get AI help optimizing your kernels.

Profiling

Profile your kernels with NCU and nsys.