Skip to main content

Baseline Discovery

The wafer baseline command traces PyTorch operations to discover which GPU kernels are dispatched. This helps you understand the performance baseline before writing custom kernels.

Quick Start

# Trace a matrix multiplication
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024

# Trace with roofline analysis
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024 --hardware H100

# List supported hardware
wafer baseline hardware

Commands

wafer baseline run

Execute an operation and trace the kernel dispatch:
wafer baseline run [OPTIONS] "<operation>"
Options:
OptionShortDescription
--shape-sTensor shape: name=dim1,dim2,... (repeatable)
--dtype-dData type for tensors (default: float16)
--hardwareHardware name for roofline analysis
--target-tGPU target for execution
--workspace-wWorkspace name
--warmupWarmup iterations (default: 10)
--runsProfiling runs (default: 100)
--no-cacheSkip cache, always run fresh
--jsonOutput as JSON
--verbose-vShow verbose output
--timeoutTimeout in seconds (default: 120)
Examples:
# Simple matmul
wafer baseline run "torch.matmul(A, B)" \
  --shape A=1024,1024 \
  --shape B=1024,1024

# Softmax with fp32
wafer baseline run "torch.softmax(X, dim=-1)" \
  --shape X=32,1024,1024 \
  --dtype float32

# Convolution
wafer baseline run "torch.nn.functional.conv2d(X, W)" \
  --shape X=32,64,224,224 \
  --shape W=128,64,3,3

wafer baseline hardware

List supported hardware for roofline analysis:
wafer baseline hardware
Output:
Supported Hardware:
  H100     - NVIDIA H100 SXM5 (80GB)
  H200     - NVIDIA H200 SXM (141GB)
  A100     - NVIDIA A100 SXM4 (80GB)
  B200     - NVIDIA B200 (next-gen)
  MI300X   - AMD Instinct MI300X
  MI250X   - AMD Instinct MI250X

Output

Baseline discovery provides:
Operation: torch.matmul(A, B)
Shapes: A=[1024, 1024], B=[1024, 1024]
Dtype: float16

Dispatched Kernel:
  Name: ampere_h16816gemm_256x128_ldg8_stages_32x3_nn
  Library: cuBLAS
  Grid: (8, 4, 1)
  Block: (256, 1, 1)

Performance:
  Median: 0.142ms
  Min: 0.138ms
  Max: 0.156ms
  Std: 0.004ms

Memory:
  Input bytes: 4.19 MB
  Output bytes: 2.10 MB
  Total: 6.29 MB

Roofline Analysis (H100):
  Arithmetic Intensity: 341.3 FLOP/byte
  Achieved TFLOPS: 15.1
  Peak TFLOPS: 989.4
  Efficiency: 1.5%
  Bottleneck: Memory bound (likely launch overhead)

Use Cases

Understanding PyTorch Dispatch

See what kernel runs for your operation:
wafer baseline run "torch.add(A, B)" --shape A=1024,1024 --shape B=1024,1024

Establishing Performance Baseline

Before optimizing, know the current performance:
wafer baseline run "torch.matmul(A, B)" \
  --shape A=4096,4096 \
  --shape B=4096,4096 \
  --hardware H100 \
  --runs 1000

Comparing Data Types

# FP16
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024 --dtype float16

# FP32
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024 --dtype float32

# BF16
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024 --dtype bfloat16

Remote Execution

Run on remote GPUs:
wafer baseline run "torch.matmul(A, B)" \
  --shape A=1024,1024 \
  --shape B=1024,1024 \
  --target h100-box

Caching

Results are cached by default to speed up repeated queries:
# Use cached result
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024

# Force fresh execution
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024 --no-cache

Tensor Variable Names

Use any valid Python identifier for tensor names:
wafer baseline run "torch.einsum('bhqd,bhkd->bhqk', query, key)" \
  --shape query=2,8,1024,64 \
  --shape key=2,8,1024,64

Next Steps