TraceLens
TraceLens generates performance reports from GPU traces, compares multiple runs, and analyzes collective communication patterns. It helps you understand performance trends and validate optimizations.Quick Start
Commands
wafer nvidia tracelens check
Verify TraceLens is available:wafer nvidia tracelens report
Generate a performance report from a trace:| Option | Description |
|---|---|
--output, -o | Output file path |
--format | Output format: text, json, html |
--top | Number of top items to show (default: 10) |
- Execution summary (duration, GPU utilization)
- Top kernels by time
- Memory transfer analysis
- Kernel launch overhead
- Recommendations
wafer nvidia tracelens compare
Compare two or more traces:| Option | Description |
|---|---|
--output, -o | Output file path |
--format | Output format: text, json, html |
--metric | Metric to compare: time, throughput, memory |
wafer nvidia tracelens collective
Analyze collective communication patterns:| Option | Description |
|---|---|
--operation | Filter by operation: all_reduce, all_gather, etc. |
--output, -o | Output file path |
Use Cases
Validating Optimizations
Compare before and after optimization:Tracking Performance Over Time
Compare across multiple versions:Distributed Training Analysis
Analyze multi-GPU communication:Report Interpretation
Key Metrics
- GPU Active Time: Time the GPU is executing kernels
- GPU Idle Time: Time waiting for CPU, memory, or synchronization
- Kernel Launch Overhead: Time between kernel submissions
- Memory Transfer Time: Time copying data between host and device
Common Bottlenecks
| Symptom | Likely Cause | Solution |
|---|---|---|
| Low GPU utilization | CPU bottleneck | Profile CPU, use async data loading |
| High transfer time | Excessive H2D/D2H copies | Batch transfers, use pinned memory |
| Many small kernels | Launch overhead | Fuse kernels, use CUDA graphs |
| Long sync time | Unnecessary synchronization | Remove explicit syncs |