Roofline Analysis

The roofline model helps you understand whether your kernel is compute-bound or memory-bound, and how close it is to hardware limits. Use wafer roofline to analyze performance against theoretical peaks.

Quick Start

# Analyze a kernel
wafer roofline --gpu H100 --bytes 1e9 --flops 1e12 --time-ms 0.5

# List available GPU specs
wafer roofline --list-gpus

Command

wafer roofline

wafer roofline [OPTIONS]

Options:

Option	Short	Description
`--gpu`	`-g`	GPU name (e.g., H100, A100, MI300X)
`--bytes`	`-b`	Theoretical minimum bytes moved
`--flops`	`-f`	Theoretical minimum FLOPs
`--time-ms`	`-t`	Actual kernel time in milliseconds
`--dtype`	`-d`	Data type for compute ceiling (default: fp16)
`--list-gpus`		List available GPU specs

Example:

wafer roofline \
  --gpu H100 \
  --bytes 2147483648 \
  --flops 2199023255552 \
  --time-ms 0.45 \
  --dtype fp16

Output:

Roofline Analysis
=================

Hardware: NVIDIA H100 SXM5
  Memory Bandwidth: 3.35 TB/s
  FP16 Tensor Peak: 989.4 TFLOPS

Your Kernel:
  Bytes: 2.00 GB
  FLOPs: 2.00 TFLOP
  Time: 0.450 ms

Arithmetic Intensity: 1024.0 FLOP/byte
Ridge Point: 295.3 FLOP/byte

Performance:
  Achieved Throughput: 4.44 TFLOPS
  Achieved Bandwidth: 4.44 TB/s

  Compute Efficiency: 0.4% of peak
  Memory Efficiency: 132.6% of peak (!)

Analysis:
  Status: COMPUTE BOUND
  Your kernel has high arithmetic intensity (1024 > ridge point 295).
  Focus on: reducing instruction count, improving occupancy, ILP.

  Note: Achieved bandwidth exceeds theoretical peak, which suggests
  the actual bytes moved is less than your estimate, or caching
  effects are significant.

Understanding the Roofline

Key Concepts

Arithmetic Intensity (AI): FLOPs per byte of memory moved
Ridge Point: AI where compute and memory rooflines intersect
Compute Bound: AI > ridge point (limited by compute)
Memory Bound: AI < ridge point (limited by memory bandwidth)

The Roofline Model

            Compute Ceiling
                 _______________
                /
               /
              /  <- Ridge Point
             /
            /
           /______ Memory Ceiling
          AI

Available GPUs

wafer roofline --list-gpus

Output:

Available GPU Specifications:

NVIDIA:
  H100     FP16: 989.4 TFLOPS    BW: 3.35 TB/s
  H200     FP16: 989.4 TFLOPS    BW: 4.80 TB/s
  A100     FP16: 312.0 TFLOPS    BW: 2.04 TB/s
  B200     FP16: 2250.0 TFLOPS   BW: 8.00 TB/s
  RTX4090  FP16: 165.2 TFLOPS    BW: 1.01 TB/s

AMD:
  MI300X   FP16: 1307.4 TFLOPS   BW: 5.30 TB/s
  MI250X   FP16: 383.0 TFLOPS    BW: 3.28 TB/s

Calculating Inputs

Bytes (Memory Traffic)

Calculate the theoretical minimum bytes:

# Matrix multiplication: C = A @ B
# A: [M, K], B: [K, N], C: [M, N]
bytes_read = (M * K + K * N) * sizeof(dtype)
bytes_write = M * N * sizeof(dtype)
total_bytes = bytes_read + bytes_write

# For fp16: sizeof = 2
# A=[1024,1024], B=[1024,1024]
# bytes = (1024*1024 + 1024*1024 + 1024*1024) * 2 = 6.29 MB

FLOPs (Floating Point Operations)

Calculate theoretical FLOPs:

# Matrix multiplication: 2*M*N*K operations
# (multiply + add for each output element, K times)
flops = 2 * M * N * K

# A=[1024,1024], B=[1024,1024]
# flops = 2 * 1024 * 1024 * 1024 = 2.15 GFLOP

Time

Get kernel time from profiling:

# From NCU report
wafer nvidia ncu analyze ./profile.ncu-rep

# From nsys trace
wafer nvidia nsys analyze ./trace.nsys-rep --kernels

Interpretation Guide

Situation	What It Means	Action
Memory efficiency < 50%	Not saturating memory	Check coalescing, caching
Compute efficiency < 50%, compute bound	Low utilization	Check occupancy, ILP
Near 100% efficiency	Optimal	Hardware limited
Bandwidth > peak	Estimate wrong or caching	Refine byte calculation

Common Issues

Bandwidth exceeds peak

Your bytes estimate may be too high. Consider:

Cache reuse reducing actual memory traffic
Broadcast/sharing between threads
Compressed data formats

Very low compute efficiency

Even for compute-bound kernels, check:

Occupancy limits
Instruction-level parallelism
Warp divergence
Register spilling

Neither bound seems right

You may be bound by something else:

Launch overhead (small kernels)
Synchronization
PCIe transfers

Next Steps

Baseline Discovery

Find baseline kernel performance.

NCU Profiler

Get detailed kernel metrics.

Kernel Evaluation

Test your optimized kernel.

AI Agent

Get optimization help.

Getting Started

CLI

AI Agent

Kernel Development

NVIDIA Profiling

NCU Profiler

Perfetto

AMD Profiling

ROCprofiler Compute

Infrastructure

Compare

Onboarding

More

Roofline Analysis

Roofline Analysis

Quick Start

Command

wafer roofline

Understanding the Roofline

Key Concepts

The Roofline Model

Available GPUs

Calculating Inputs

Bytes (Memory Traffic)

FLOPs (Floating Point Operations)

Time

Interpretation Guide

Common Issues

Next Steps

Baseline Discovery

NCU Profiler

Kernel Evaluation

AI Agent

Getting Started

CLI

AI Agent

Kernel Development

NVIDIA Profiling

NCU Profiler

Perfetto

AMD Profiling

ROCprofiler Compute

Infrastructure

Compare

Onboarding

More

​Roofline Analysis

​Quick Start

​Command

​wafer roofline

​Understanding the Roofline

​Key Concepts

​The Roofline Model

​Available GPUs

​Calculating Inputs

​Bytes (Memory Traffic)

​FLOPs (Floating Point Operations)

​Time

​Interpretation Guide

​Common Issues

​Next Steps

Baseline Discovery

NCU Profiler

Kernel Evaluation

AI Agent

Roofline Analysis

Quick Start

Command

wafer roofline

Understanding the Roofline

Key Concepts

The Roofline Model

Available GPUs

Calculating Inputs

Bytes (Memory Traffic)

FLOPs (Floating Point Operations)

Time

Interpretation Guide

Common Issues

Next Steps