TraceLens

TraceLens generates performance reports from GPU traces, compares multiple runs, and analyzes collective communication patterns. It helps you understand performance trends and validate optimizations.

Quick Start

# Check installation
wafer nvidia tracelens check

# Generate a performance report
wafer nvidia tracelens report ./profile.nsys-rep

# Compare two traces
wafer nvidia tracelens compare ./baseline.nsys-rep ./optimized.nsys-rep

# Analyze collective operations
wafer nvidia tracelens collective ./distributed-trace.nsys-rep

Commands

wafer nvidia tracelens check

Verify TraceLens is available:

wafer nvidia tracelens check

wafer nvidia tracelens report

Generate a performance report from a trace:

wafer nvidia tracelens report [OPTIONS] <trace-file>

Options:

Option	Description
`--output`, `-o`	Output file path
`--format`	Output format: `text`, `json`, `html`
`--top`	Number of top items to show (default: 10)

Example:

wafer nvidia tracelens report ./profile.nsys-rep --format html -o report.html

Report Contents:

Execution summary (duration, GPU utilization)
Top kernels by time
Memory transfer analysis
Kernel launch overhead
Recommendations

wafer nvidia tracelens compare

Compare two or more traces:

wafer nvidia tracelens compare [OPTIONS] <trace1> <trace2> [<trace3>...]

Options:

Option	Description
`--output`, `-o`	Output file path
`--format`	Output format: `text`, `json`, `html`
`--metric`	Metric to compare: `time`, `throughput`, `memory`

Example:

wafer nvidia tracelens compare ./v1.nsys-rep ./v2.nsys-rep ./v3.nsys-rep

Output:

Trace Comparison
================

                    v1.nsys-rep    v2.nsys-rep    v3.nsys-rep
Total Duration      12.34s         10.21s         8.45s
GPU Active Time     8.12s          7.89s          7.23s
Speedup vs v1       1.00x          1.21x          1.46x

Kernel Comparison (top 5):
                        v1          v2          v3
volta_sgemm_128x64      5.58s       4.12s       3.21s  (-42%)
elementwise_kernel      2.85s       2.85s       2.01s  (-29%)
reduce_kernel           1.52s       1.23s       1.12s  (-26%)

wafer nvidia tracelens collective

Analyze collective communication patterns:

wafer nvidia tracelens collective [OPTIONS] <trace-file>

Options:

Option	Description
`--operation`	Filter by operation: `all_reduce`, `all_gather`, etc.
`--output`, `-o`	Output file path

Example:

wafer nvidia tracelens collective ./distributed.nsys-rep

Output:

Collective Communication Analysis
=================================

Total collective time: 2.34s (19% of GPU time)

Operation Breakdown:
  all_reduce       1.45s  (62%)  - 234 calls, avg 6.2ms
  all_gather       0.67s  (29%)  - 45 calls, avg 14.9ms
  broadcast        0.22s  (9%)   - 12 calls, avg 18.3ms

Recommendations:
  - Consider gradient bucketing to reduce all_reduce frequency
  - all_gather operations are larger than optimal (>1MB avg)

Use Cases

Validating Optimizations

Compare before and after optimization:

# Profile baseline
wafer nvidia nsys profile -o baseline.nsys-rep "python train.py"

# Make optimizations...

# Profile optimized version
wafer nvidia nsys profile -o optimized.nsys-rep "python train.py"

# Compare
wafer nvidia tracelens compare baseline.nsys-rep optimized.nsys-rep

Tracking Performance Over Time

Compare across multiple versions:

wafer nvidia tracelens compare \
  ./traces/v1.0.nsys-rep \
  ./traces/v1.1.nsys-rep \
  ./traces/v1.2.nsys-rep \
  --format html -o performance-history.html

Distributed Training Analysis

Analyze multi-GPU communication:

# Profile distributed training
wafer nvidia nsys profile -o distributed.nsys-rep \
  "torchrun --nproc_per_node=4 train.py"

# Analyze collectives
wafer nvidia tracelens collective distributed.nsys-rep

Report Interpretation

Key Metrics

GPU Active Time: Time the GPU is executing kernels
GPU Idle Time: Time waiting for CPU, memory, or synchronization
Kernel Launch Overhead: Time between kernel submissions
Memory Transfer Time: Time copying data between host and device

Common Bottlenecks

Symptom	Likely Cause	Solution
Low GPU utilization	CPU bottleneck	Profile CPU, use async data loading
High transfer time	Excessive H2D/D2H copies	Batch transfers, use pinned memory
Many small kernels	Launch overhead	Fuse kernels, use CUDA graphs
Long sync time	Unnecessary synchronization	Remove explicit syncs

Next Steps

Nsight Systems

Capture traces for TraceLens analysis.

NCU Profiler

Deep dive into slow kernels.

Cross-Platform Compare

Compare NVIDIA and AMD traces.

AI Agent

Use AI to analyze traces.

Getting Started

CLI

AI Agent

Kernel Development

NVIDIA Profiling

NCU Profiler

Perfetto

AMD Profiling

ROCprofiler Compute

Infrastructure

Compare

Onboarding

More

TraceLens

TraceLens

Quick Start

Commands

wafer nvidia tracelens check

wafer nvidia tracelens report

wafer nvidia tracelens compare

wafer nvidia tracelens collective

Use Cases

Validating Optimizations

Tracking Performance Over Time

Distributed Training Analysis

Report Interpretation

Key Metrics

Common Bottlenecks

Next Steps

Nsight Systems

NCU Profiler

Cross-Platform Compare

AI Agent

Getting Started

CLI

AI Agent

Kernel Development

NVIDIA Profiling

NCU Profiler

Perfetto

AMD Profiling

ROCprofiler Compute

Infrastructure

Compare

Onboarding

More

​TraceLens

​Quick Start

​Commands

​wafer nvidia tracelens check

​wafer nvidia tracelens report

​wafer nvidia tracelens compare

​wafer nvidia tracelens collective

​Use Cases

​Validating Optimizations

​Tracking Performance Over Time

​Distributed Training Analysis

​Report Interpretation

​Key Metrics

​Common Bottlenecks

​Next Steps

Nsight Systems

NCU Profiler

Cross-Platform Compare

AI Agent

TraceLens

Quick Start

Commands

wafer nvidia tracelens check

wafer nvidia tracelens report

wafer nvidia tracelens compare

wafer nvidia tracelens collective

Use Cases

Validating Optimizations

Tracking Performance Over Time

Distributed Training Analysis

Report Interpretation

Key Metrics

Common Bottlenecks

Next Steps