Baseline Discovery

The wafer baseline command traces PyTorch operations to discover which GPU kernels are dispatched. This helps you understand the performance baseline before writing custom kernels.

Quick Start

# Trace a matrix multiplication
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024

# Trace with roofline analysis
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024 --hardware H100

# List supported hardware
wafer baseline hardware

Commands

wafer baseline run

Execute an operation and trace the kernel dispatch:

wafer baseline run [OPTIONS] "<operation>"

Options:

Option	Short	Description
`--shape`	`-s`	Tensor shape: `name=dim1,dim2,...` (repeatable)
`--dtype`	`-d`	Data type for tensors (default: float16)
`--hardware`		Hardware name for roofline analysis
`--target`	`-t`	GPU target for execution
`--workspace`	`-w`	Workspace name
`--warmup`		Warmup iterations (default: 10)
`--runs`		Profiling runs (default: 100)
`--no-cache`		Skip cache, always run fresh
`--json`		Output as JSON
`--verbose`	`-v`	Show verbose output
`--timeout`		Timeout in seconds (default: 120)

Examples:

# Simple matmul
wafer baseline run "torch.matmul(A, B)" \
  --shape A=1024,1024 \
  --shape B=1024,1024

# Softmax with fp32
wafer baseline run "torch.softmax(X, dim=-1)" \
  --shape X=32,1024,1024 \
  --dtype float32

# Convolution
wafer baseline run "torch.nn.functional.conv2d(X, W)" \
  --shape X=32,64,224,224 \
  --shape W=128,64,3,3

wafer baseline hardware

List supported hardware for roofline analysis:

wafer baseline hardware

Output:

Supported Hardware:
  H100     - NVIDIA H100 SXM5 (80GB)
  H200     - NVIDIA H200 SXM (141GB)
  A100     - NVIDIA A100 SXM4 (80GB)
  B200     - NVIDIA B200 (next-gen)
  MI300X   - AMD Instinct MI300X
  MI250X   - AMD Instinct MI250X

Output

Baseline discovery provides:

Operation: torch.matmul(A, B)
Shapes: A=[1024, 1024], B=[1024, 1024]
Dtype: float16

Dispatched Kernel:
  Name: ampere_h16816gemm_256x128_ldg8_stages_32x3_nn
  Library: cuBLAS
  Grid: (8, 4, 1)
  Block: (256, 1, 1)

Performance:
  Median: 0.142ms
  Min: 0.138ms
  Max: 0.156ms
  Std: 0.004ms

Memory:
  Input bytes: 4.19 MB
  Output bytes: 2.10 MB
  Total: 6.29 MB

Roofline Analysis (H100):
  Arithmetic Intensity: 341.3 FLOP/byte
  Achieved TFLOPS: 15.1
  Peak TFLOPS: 989.4
  Efficiency: 1.5%
  Bottleneck: Memory bound (likely launch overhead)

Use Cases

Understanding PyTorch Dispatch

See what kernel runs for your operation:

wafer baseline run "torch.add(A, B)" --shape A=1024,1024 --shape B=1024,1024

Establishing Performance Baseline

Before optimizing, know the current performance:

wafer baseline run "torch.matmul(A, B)" \
  --shape A=4096,4096 \
  --shape B=4096,4096 \
  --hardware H100 \
  --runs 1000

Comparing Data Types

# FP16
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024 --dtype float16

# FP32
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024 --dtype float32

# BF16
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024 --dtype bfloat16

Remote Execution

Run on remote GPUs:

wafer baseline run "torch.matmul(A, B)" \
  --shape A=1024,1024 \
  --shape B=1024,1024 \
  --target h100-box

Caching

Results are cached by default to speed up repeated queries:

# Use cached result
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024

# Force fresh execution
wafer baseline run "torch.matmul(A, B)" --shape A=1024,1024 --shape B=1024,1024 --no-cache

Tensor Variable Names

Use any valid Python identifier for tensor names:

wafer baseline run "torch.einsum('bhqd,bhkd->bhqk', query, key)" \
  --shape query=2,8,1024,64 \
  --shape key=2,8,1024,64

Next Steps

Roofline Analysis

Deeper performance analysis.

Kernel Evaluation

Test your custom kernel.

AI Agent

Get help optimizing.

Profiling

Profile the dispatched kernel.

Getting Started

CLI

AI Agent

Kernel Development

NVIDIA Profiling

NCU Profiler

Perfetto

AMD Profiling

ROCprofiler Compute

Infrastructure

Compare

Onboarding

More

Baseline Discovery

Baseline Discovery

Quick Start

Commands

wafer baseline run

wafer baseline hardware

Output

Use Cases

Understanding PyTorch Dispatch

Establishing Performance Baseline

Comparing Data Types

Remote Execution

Caching

Tensor Variable Names

Next Steps

Roofline Analysis

Kernel Evaluation

AI Agent

Profiling

Getting Started

CLI

AI Agent

Kernel Development

NVIDIA Profiling

NCU Profiler

Perfetto

AMD Profiling

ROCprofiler Compute

Infrastructure

Compare

Onboarding

More

​Baseline Discovery

​Quick Start

​Commands

​wafer baseline run

​wafer baseline hardware

​Output

​Use Cases

​Understanding PyTorch Dispatch

​Establishing Performance Baseline

​Comparing Data Types

​Remote Execution

​Caching

​Tensor Variable Names

​Next Steps

Roofline Analysis

Kernel Evaluation

AI Agent

Profiling

Baseline Discovery

Quick Start

Commands

wafer baseline run

wafer baseline hardware

Output

Use Cases

Understanding PyTorch Dispatch

Establishing Performance Baseline

Comparing Data Types

Remote Execution

Caching

Tensor Variable Names

Next Steps