Skip to main content

TraceLens

TraceLens generates performance reports from GPU traces, compares multiple runs, and analyzes collective communication patterns. It helps you understand performance trends and validate optimizations.

Quick Start

# Check installation
wafer nvidia tracelens check

# Generate a performance report
wafer nvidia tracelens report ./profile.nsys-rep

# Compare two traces
wafer nvidia tracelens compare ./baseline.nsys-rep ./optimized.nsys-rep

# Analyze collective operations
wafer nvidia tracelens collective ./distributed-trace.nsys-rep

Commands

wafer nvidia tracelens check

Verify TraceLens is available:
wafer nvidia tracelens check

wafer nvidia tracelens report

Generate a performance report from a trace:
wafer nvidia tracelens report [OPTIONS] <trace-file>
Options:
OptionDescription
--output, -oOutput file path
--formatOutput format: text, json, html
--topNumber of top items to show (default: 10)
Example:
wafer nvidia tracelens report ./profile.nsys-rep --format html -o report.html
Report Contents:
  • Execution summary (duration, GPU utilization)
  • Top kernels by time
  • Memory transfer analysis
  • Kernel launch overhead
  • Recommendations

wafer nvidia tracelens compare

Compare two or more traces:
wafer nvidia tracelens compare [OPTIONS] <trace1> <trace2> [<trace3>...]
Options:
OptionDescription
--output, -oOutput file path
--formatOutput format: text, json, html
--metricMetric to compare: time, throughput, memory
Example:
wafer nvidia tracelens compare ./v1.nsys-rep ./v2.nsys-rep ./v3.nsys-rep
Output:
Trace Comparison
================

                    v1.nsys-rep    v2.nsys-rep    v3.nsys-rep
Total Duration      12.34s         10.21s         8.45s
GPU Active Time     8.12s          7.89s          7.23s
Speedup vs v1       1.00x          1.21x          1.46x

Kernel Comparison (top 5):
                        v1          v2          v3
volta_sgemm_128x64      5.58s       4.12s       3.21s  (-42%)
elementwise_kernel      2.85s       2.85s       2.01s  (-29%)
reduce_kernel           1.52s       1.23s       1.12s  (-26%)

wafer nvidia tracelens collective

Analyze collective communication patterns:
wafer nvidia tracelens collective [OPTIONS] <trace-file>
Options:
OptionDescription
--operationFilter by operation: all_reduce, all_gather, etc.
--output, -oOutput file path
Example:
wafer nvidia tracelens collective ./distributed.nsys-rep
Output:
Collective Communication Analysis
=================================

Total collective time: 2.34s (19% of GPU time)

Operation Breakdown:
  all_reduce       1.45s  (62%)  - 234 calls, avg 6.2ms
  all_gather       0.67s  (29%)  - 45 calls, avg 14.9ms
  broadcast        0.22s  (9%)   - 12 calls, avg 18.3ms

Recommendations:
  - Consider gradient bucketing to reduce all_reduce frequency
  - all_gather operations are larger than optimal (>1MB avg)

Use Cases

Validating Optimizations

Compare before and after optimization:
# Profile baseline
wafer nvidia nsys profile -o baseline.nsys-rep "python train.py"

# Make optimizations...

# Profile optimized version
wafer nvidia nsys profile -o optimized.nsys-rep "python train.py"

# Compare
wafer nvidia tracelens compare baseline.nsys-rep optimized.nsys-rep

Tracking Performance Over Time

Compare across multiple versions:
wafer nvidia tracelens compare \
  ./traces/v1.0.nsys-rep \
  ./traces/v1.1.nsys-rep \
  ./traces/v1.2.nsys-rep \
  --format html -o performance-history.html

Distributed Training Analysis

Analyze multi-GPU communication:
# Profile distributed training
wafer nvidia nsys profile -o distributed.nsys-rep \
  "torchrun --nproc_per_node=4 train.py"

# Analyze collectives
wafer nvidia tracelens collective distributed.nsys-rep

Report Interpretation

Key Metrics

  • GPU Active Time: Time the GPU is executing kernels
  • GPU Idle Time: Time waiting for CPU, memory, or synchronization
  • Kernel Launch Overhead: Time between kernel submissions
  • Memory Transfer Time: Time copying data between host and device

Common Bottlenecks

SymptomLikely CauseSolution
Low GPU utilizationCPU bottleneckProfile CPU, use async data loading
High transfer timeExcessive H2D/D2H copiesBatch transfers, use pinned memory
Many small kernelsLaunch overheadFuse kernels, use CUDA graphs
Long sync timeUnnecessary synchronizationRemove explicit syncs

Next Steps