Skip to main content

Kernel Evaluation

The wafer evaluate command tests your GPU kernel implementations for correctness and optionally benchmarks their performance. It supports two kernel formats: GPUMode and KernelBench.

Quick Start

# Evaluate a GPUMode kernel
wafer evaluate gpumode --impl ./my_kernel.py --reference ./reference.py

# Evaluate a KernelBench kernel
wafer evaluate kernelbench --impl ./solution.py --reference ./problem.py

Formats

GPUMode Format

GPUMode kernels define custom_kernel and ref_kernel functions:
# reference.py
def ref_kernel(input_tensor):
    return input_tensor * 2

# my_kernel.py
import torch
import triton
import triton.language as tl

@triton.jit
def kernel(input_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(input_ptr + offsets, mask=mask)
    tl.store(output_ptr + offsets, x * 2, mask=mask)

def custom_kernel(input_tensor):
    output = torch.empty_like(input_tensor)
    n_elements = input_tensor.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    kernel[grid](input_tensor, output, n_elements, BLOCK_SIZE=1024)
    return output

KernelBench Format

KernelBench uses a ModelNew class that replaces Model:
# problem.py (reference)
class Model(torch.nn.Module):
    def forward(self, x):
        return torch.softmax(x, dim=-1)

# solution.py (your implementation)
class ModelNew(torch.nn.Module):
    def forward(self, x):
        # Your optimized implementation
        return custom_softmax(x)

Commands

wafer evaluate gpumode

wafer evaluate gpumode [OPTIONS]
Options:
OptionShortDescription
--impl-iPath to implementation kernel file
--referencePath to reference kernel file
--test-casesPath to test cases JSON file
--target-tGPU target name
--benchmarkRun performance benchmarks
--profileEnable profiling
--defense/--no-defenseRun reward hack defense checks (default: enabled)
--gpu-idOverride GPU ID
Example with benchmarking:
wafer evaluate gpumode \
  --impl ./my_kernel.py \
  --reference ./reference.py \
  --benchmark \
  --target local

wafer evaluate kernelbench

wafer evaluate kernelbench [OPTIONS]
Options: Same as gpumode, plus format-specific options. Example:
wafer evaluate kernelbench \
  --impl ./solution.py \
  --reference ./23_Softmax.py \
  --benchmark

wafer evaluate make-template

Generate template files to get started:
# Create templates in current directory
wafer evaluate make-template

# Create templates in specific directory
wafer evaluate make-template ./my-kernel --force

Test Cases

Provide custom test cases via JSON:
{
  "test_cases": [
    {
      "name": "small",
      "inputs": {"x": {"shape": [32, 32], "dtype": "float16"}},
      "rtol": 1e-3,
      "atol": 1e-3
    },
    {
      "name": "large",
      "inputs": {"x": {"shape": [4096, 4096], "dtype": "float16"}},
      "rtol": 1e-3,
      "atol": 1e-3
    }
  ]
}
wafer evaluate gpumode --test-cases ./cases.json --impl ./kernel.py

Defense Checks

By default, evaluation includes defense checks to detect potential reward hacking:
  • Verifies implementation doesn’t just copy the reference
  • Checks for meaningful computation
  • Validates output shapes and dtypes
Disable with --no-defense if needed for debugging.

Remote Evaluation

Run on remote GPUs using targets:
# List available targets
wafer config targets list

# Evaluate on a specific target
wafer evaluate gpumode --impl ./kernel.py --target my-h100
See Targets for setting up remote GPUs.

Output

Successful evaluation shows:
✓ Correctness: PASSED (all test cases)
  - small: max_diff=1.2e-5, rtol=1e-3 ✓
  - large: max_diff=3.4e-4, rtol=1e-3 ✓

Performance (--benchmark):
  - Implementation: 0.45ms (±0.02ms)
  - Reference: 1.23ms (±0.05ms)
  - Speedup: 2.73x

Next Steps