Kernel Evaluation

The wafer evaluate command tests your GPU kernel implementations for correctness and optionally benchmarks their performance. It supports two kernel formats: GPUMode and KernelBench.

Quick Start

# Evaluate a GPUMode kernel
wafer evaluate gpumode --impl ./my_kernel.py --reference ./reference.py

# Evaluate a KernelBench kernel
wafer evaluate kernelbench --impl ./solution.py --reference ./problem.py

Formats

GPUMode Format

GPUMode kernels define custom_kernel and ref_kernel functions:

# reference.py
def ref_kernel(input_tensor):
    return input_tensor * 2

# my_kernel.py
import torch
import triton
import triton.language as tl

@triton.jit
def kernel(input_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(input_ptr + offsets, mask=mask)
    tl.store(output_ptr + offsets, x * 2, mask=mask)

def custom_kernel(input_tensor):
    output = torch.empty_like(input_tensor)
    n_elements = input_tensor.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    kernel[grid](input_tensor, output, n_elements, BLOCK_SIZE=1024)
    return output

KernelBench Format

KernelBench uses a ModelNew class that replaces Model:

# problem.py (reference)
class Model(torch.nn.Module):
    def forward(self, x):
        return torch.softmax(x, dim=-1)

# solution.py (your implementation)
class ModelNew(torch.nn.Module):
    def forward(self, x):
        # Your optimized implementation
        return custom_softmax(x)

Commands

wafer evaluate gpumode

wafer evaluate gpumode [OPTIONS]

Options:

Option	Short	Description
`--impl`	`-i`	Path to implementation kernel file
`--reference`		Path to reference kernel file
`--test-cases`		Path to test cases JSON file
`--target`	`-t`	GPU target name
`--benchmark`		Run performance benchmarks
`--profile`		Enable profiling
`--defense/--no-defense`		Run reward hack defense checks (default: enabled)
`--gpu-id`		Override GPU ID

Example with benchmarking:

wafer evaluate gpumode \
  --impl ./my_kernel.py \
  --reference ./reference.py \
  --benchmark \
  --target local

wafer evaluate kernelbench

wafer evaluate kernelbench [OPTIONS]

Options: Same as gpumode, plus format-specific options. Example:

wafer evaluate kernelbench \
  --impl ./solution.py \
  --reference ./23_Softmax.py \
  --benchmark

wafer evaluate make-template

Generate template files to get started:

# Create templates in current directory
wafer evaluate make-template

# Create templates in specific directory
wafer evaluate make-template ./my-kernel --force

Test Cases

Provide custom test cases via JSON:

{
  "test_cases": [
    {
      "name": "small",
      "inputs": {"x": {"shape": [32, 32], "dtype": "float16"}},
      "rtol": 1e-3,
      "atol": 1e-3
    },
    {
      "name": "large",
      "inputs": {"x": {"shape": [4096, 4096], "dtype": "float16"}},
      "rtol": 1e-3,
      "atol": 1e-3
    }
  ]
}

wafer evaluate gpumode --test-cases ./cases.json --impl ./kernel.py

Defense Checks

By default, evaluation includes defense checks to detect potential reward hacking:

Verifies implementation doesn’t just copy the reference
Checks for meaningful computation
Validates output shapes and dtypes

Disable with --no-defense if needed for debugging.

Remote Evaluation

Run on remote GPUs using targets:

# List available targets
wafer config targets list

# Evaluate on a specific target
wafer evaluate gpumode --impl ./kernel.py --target my-h100

See Targets for setting up remote GPUs.

Output

Successful evaluation shows:

✓ Correctness: PASSED (all test cases)
  - small: max_diff=1.2e-5, rtol=1e-3 ✓
  - large: max_diff=3.4e-4, rtol=1e-3 ✓

Performance (--benchmark):
  - Implementation: 0.45ms (±0.02ms)
  - Reference: 1.23ms (±0.05ms)
  - Speedup: 2.73x

Next Steps

Baseline Discovery

See what kernels PyTorch dispatches to.

Roofline Analysis

Analyze performance against hardware limits.

AI Agent

Get AI help optimizing your kernels.

Profiling

Profile your kernels with NCU and nsys.

Getting Started

CLI

AI Agent

Kernel Development

NVIDIA Profiling

NCU Profiler

Perfetto

AMD Profiling

ROCprofiler Compute

Infrastructure

Compare

Onboarding

More

Kernel Evaluation

Kernel Evaluation

Quick Start

Formats

GPUMode Format

KernelBench Format

Commands

wafer evaluate gpumode

wafer evaluate kernelbench

wafer evaluate make-template

Test Cases

Defense Checks

Remote Evaluation

Output

Next Steps

Baseline Discovery

Roofline Analysis

AI Agent

Profiling

Getting Started

CLI

AI Agent

Kernel Development

NVIDIA Profiling

NCU Profiler

Perfetto

AMD Profiling

ROCprofiler Compute

Infrastructure

Compare

Onboarding

More

​Kernel Evaluation

​Quick Start

​Formats

​GPUMode Format

​KernelBench Format

​Commands

​wafer evaluate gpumode

​wafer evaluate kernelbench

​wafer evaluate make-template

​Test Cases

​Defense Checks

​Remote Evaluation

​Output

​Next Steps

Baseline Discovery

Roofline Analysis

AI Agent

Profiling

Kernel Evaluation

Quick Start

Formats

GPUMode Format

KernelBench Format

Commands

wafer evaluate gpumode

wafer evaluate kernelbench

wafer evaluate make-template

Test Cases

Defense Checks

Remote Evaluation

Output

Next Steps