Roofline Analysis
The roofline model helps you understand whether your kernel is compute-bound or memory-bound, and how close it is to hardware limits. Usewafer roofline to analyze performance against theoretical peaks.
Quick Start
Command
wafer roofline
| Option | Short | Description |
|---|---|---|
--gpu | -g | GPU name (e.g., H100, A100, MI300X) |
--bytes | -b | Theoretical minimum bytes moved |
--flops | -f | Theoretical minimum FLOPs |
--time-ms | -t | Actual kernel time in milliseconds |
--dtype | -d | Data type for compute ceiling (default: fp16) |
--list-gpus | List available GPU specs |
Understanding the Roofline
Key Concepts
- Arithmetic Intensity (AI): FLOPs per byte of memory moved
- Ridge Point: AI where compute and memory rooflines intersect
- Compute Bound: AI > ridge point (limited by compute)
- Memory Bound: AI < ridge point (limited by memory bandwidth)
The Roofline Model
Available GPUs
Calculating Inputs
Bytes (Memory Traffic)
Calculate the theoretical minimum bytes:FLOPs (Floating Point Operations)
Calculate theoretical FLOPs:Time
Get kernel time from profiling:Interpretation Guide
| Situation | What It Means | Action |
|---|---|---|
| Memory efficiency < 50% | Not saturating memory | Check coalescing, caching |
| Compute efficiency < 50%, compute bound | Low utilization | Check occupancy, ILP |
| Near 100% efficiency | Optimal | Hardware limited |
| Bandwidth > peak | Estimate wrong or caching | Refine byte calculation |
Common Issues
Bandwidth exceeds peak
Bandwidth exceeds peak
Your bytes estimate may be too high. Consider:
- Cache reuse reducing actual memory traffic
- Broadcast/sharing between threads
- Compressed data formats
Very low compute efficiency
Very low compute efficiency
Even for compute-bound kernels, check:
- Occupancy limits
- Instruction-level parallelism
- Warp divergence
- Register spilling
Neither bound seems right
Neither bound seems right
You may be bound by something else:
- Launch overhead (small kernels)
- Synchronization
- PCIe transfers