Skip to main content

Roofline Analysis

The roofline model helps you understand whether your kernel is compute-bound or memory-bound, and how close it is to hardware limits. Use wafer roofline to analyze performance against theoretical peaks.

Quick Start

# Analyze a kernel
wafer roofline --gpu H100 --bytes 1e9 --flops 1e12 --time-ms 0.5

# List available GPU specs
wafer roofline --list-gpus

Command

wafer roofline

wafer roofline [OPTIONS]
Options:
OptionShortDescription
--gpu-gGPU name (e.g., H100, A100, MI300X)
--bytes-bTheoretical minimum bytes moved
--flops-fTheoretical minimum FLOPs
--time-ms-tActual kernel time in milliseconds
--dtype-dData type for compute ceiling (default: fp16)
--list-gpusList available GPU specs
Example:
wafer roofline \
  --gpu H100 \
  --bytes 2147483648 \
  --flops 2199023255552 \
  --time-ms 0.45 \
  --dtype fp16
Output:
Roofline Analysis
=================

Hardware: NVIDIA H100 SXM5
  Memory Bandwidth: 3.35 TB/s
  FP16 Tensor Peak: 989.4 TFLOPS

Your Kernel:
  Bytes: 2.00 GB
  FLOPs: 2.00 TFLOP
  Time: 0.450 ms

Arithmetic Intensity: 1024.0 FLOP/byte
Ridge Point: 295.3 FLOP/byte

Performance:
  Achieved Throughput: 4.44 TFLOPS
  Achieved Bandwidth: 4.44 TB/s

  Compute Efficiency: 0.4% of peak
  Memory Efficiency: 132.6% of peak (!)

Analysis:
  Status: COMPUTE BOUND
  Your kernel has high arithmetic intensity (1024 > ridge point 295).
  Focus on: reducing instruction count, improving occupancy, ILP.

  Note: Achieved bandwidth exceeds theoretical peak, which suggests
  the actual bytes moved is less than your estimate, or caching
  effects are significant.

Understanding the Roofline

Key Concepts

  • Arithmetic Intensity (AI): FLOPs per byte of memory moved
  • Ridge Point: AI where compute and memory rooflines intersect
  • Compute Bound: AI > ridge point (limited by compute)
  • Memory Bound: AI < ridge point (limited by memory bandwidth)

The Roofline Model

            Compute Ceiling
                 _______________
                /
               /
              /  <- Ridge Point
             /
            /
           /______ Memory Ceiling
          AI

Available GPUs

wafer roofline --list-gpus
Output:
Available GPU Specifications:

NVIDIA:
  H100     FP16: 989.4 TFLOPS    BW: 3.35 TB/s
  H200     FP16: 989.4 TFLOPS    BW: 4.80 TB/s
  A100     FP16: 312.0 TFLOPS    BW: 2.04 TB/s
  B200     FP16: 2250.0 TFLOPS   BW: 8.00 TB/s
  RTX4090  FP16: 165.2 TFLOPS    BW: 1.01 TB/s

AMD:
  MI300X   FP16: 1307.4 TFLOPS   BW: 5.30 TB/s
  MI250X   FP16: 383.0 TFLOPS    BW: 3.28 TB/s

Calculating Inputs

Bytes (Memory Traffic)

Calculate the theoretical minimum bytes:
# Matrix multiplication: C = A @ B
# A: [M, K], B: [K, N], C: [M, N]
bytes_read = (M * K + K * N) * sizeof(dtype)
bytes_write = M * N * sizeof(dtype)
total_bytes = bytes_read + bytes_write

# For fp16: sizeof = 2
# A=[1024,1024], B=[1024,1024]
# bytes = (1024*1024 + 1024*1024 + 1024*1024) * 2 = 6.29 MB

FLOPs (Floating Point Operations)

Calculate theoretical FLOPs:
# Matrix multiplication: 2*M*N*K operations
# (multiply + add for each output element, K times)
flops = 2 * M * N * K

# A=[1024,1024], B=[1024,1024]
# flops = 2 * 1024 * 1024 * 1024 = 2.15 GFLOP

Time

Get kernel time from profiling:
# From NCU report
wafer nvidia ncu analyze ./profile.ncu-rep

# From nsys trace
wafer nvidia nsys analyze ./trace.nsys-rep --kernels

Interpretation Guide

SituationWhat It MeansAction
Memory efficiency < 50%Not saturating memoryCheck coalescing, caching
Compute efficiency < 50%, compute boundLow utilizationCheck occupancy, ILP
Near 100% efficiencyOptimalHardware limited
Bandwidth > peakEstimate wrong or cachingRefine byte calculation

Common Issues

Your bytes estimate may be too high. Consider:
  • Cache reuse reducing actual memory traffic
  • Broadcast/sharing between threads
  • Compressed data formats
Even for compute-bound kernels, check:
  • Occupancy limits
  • Instruction-level parallelism
  • Warp divergence
  • Register spilling
You may be bound by something else:
  • Launch overhead (small kernels)
  • Synchronization
  • PCIe transfers

Next Steps