Cross-Platform Comparison

The wafer compare command analyzes and compares GPU traces across different platforms. Compare NVIDIA and AMD implementations, align kernel executions, and understand performance differences.

Quick Start

# Compare two traces
wafer compare analyze ./nvidia-trace.nsys-rep ./amd-trace.json

# Fusion comparison
wafer compare fusion ./baseline.json ./optimized.json

# Align and compare
wafer compare align ./trace-a.json ./trace-b.json

Commands

wafer compare analyze

Compare GPU traces from different sources:

wafer compare analyze [OPTIONS] <trace1> <trace2>

Options:

Option	Description
`--output`, `-o`	Output file path
`--format`	Output format: `text`, `json`, `html`
`--metric`	Comparison metric: `time`, `throughput`, `efficiency`

Example:

wafer compare analyze ./nvidia.nsys-rep ./amd.json --format html -o comparison.html

Output:

Cross-Platform Comparison
=========================

Trace A: nvidia.nsys-rep (NVIDIA H100)
Trace B: amd.json (AMD MI300X)

Overall:
                    NVIDIA H100    AMD MI300X    Difference
  Total Time        12.34s         14.21s        +15.2%
  GPU Active        10.12s         11.89s        +17.5%
  Kernel Count      1,234          1,234         -

Kernel Comparison (matched):
  matmul_kernel     5.23ms         4.89ms        -6.5% (AMD faster)
  attention         3.12ms         3.78ms        +21.2% (NVIDIA faster)
  softmax           1.02ms         0.98ms        -3.9%

Unmatched Kernels:
  NVIDIA-only: volta_sgemm_128x64 (cuBLAS)
  AMD-only: rocblas_sgemm_kernel

wafer compare fusion

Compare fusion patterns between implementations:

wafer compare fusion [OPTIONS] <trace1> <trace2>

Options:

Option	Description
`--output`, `-o`	Output file path
`--threshold`	Time threshold for fusion detection

Example:

wafer compare fusion ./baseline.json ./optimized.json

Output:

Fusion Analysis
===============

Baseline: 23 separate kernel launches
Optimized: 8 fused kernels

Detected Fusions:
  1. softmax + dropout + scale → fused_attention_1
     Baseline: 3.2ms (3 kernels)
     Fused: 1.1ms (1 kernel)
     Speedup: 2.9x

  2. layernorm + add + gelu → fused_mlp_1
     Baseline: 2.8ms (3 kernels)
     Fused: 0.9ms (1 kernel)
     Speedup: 3.1x

Total fusion benefit: 4.2ms saved (18% of kernel time)

wafer compare align

Align kernel executions between traces:

wafer compare align [OPTIONS] <trace1> <trace2>

Options:

Option	Description
`--output`, `-o`	Output file path
`--method`	Alignment method: `name`, `sequence`, `hybrid`

Example:

wafer compare align ./run1.json ./run2.json --method sequence

Output:

Kernel Alignment
================

Alignment Method: sequence
Matched Kernels: 1,234 / 1,240 (99.5%)

Aligned Pairs:
  #    Trace A                    Trace B                    Time Diff
  1    matmul_kernel[0]           matmul_kernel[0]           +0.02ms
  2    softmax_kernel[0]          softmax_kernel[0]          -0.01ms
  3    attention_kernel[0]        attention_kernel[0]        +0.15ms
  ...

Unaligned (Trace A only):
  - debug_print_kernel (6 instances)

Unaligned (Trace B only):
  (none)

Supported Trace Formats

Format	Extension	Platform
Nsight Systems	`.nsys-rep`	NVIDIA
NCU Report	`.ncu-rep`	NVIDIA
ROCprof Systems	`.json`	AMD
ROCprof Compute	`.csv`	AMD
Perfetto	`.perfetto`, `.pftrace`	Both
PyTorch Trace	`.json`	Both

Use Cases

NVIDIA vs AMD Comparison

Compare the same workload on different GPUs:

# Profile on NVIDIA
wafer nvidia nsys profile -o nvidia.nsys-rep "python train.py"

# Profile on AMD
wafer amd rocprof-systems run -o amd-output "python train.py"

# Compare
wafer compare analyze ./nvidia.nsys-rep ./amd-output/trace.json

Optimization Validation

Verify optimizations work across platforms:

# Baseline on both platforms
wafer compare analyze ./baseline-nvidia.nsys-rep ./baseline-amd.json -o baseline.html

# Optimized on both platforms
wafer compare analyze ./optimized-nvidia.nsys-rep ./optimized-amd.json -o optimized.html

Kernel Matching

Understand how similar operations map between platforms:

wafer compare align \
  --method hybrid \
  ./nvidia-trace.json \
  ./amd-trace.json \
  -o alignment.json

Interpretation

Performance Ratio

< 1.0: Trace B is faster
= 1.0: Equal performance
> 1.0: Trace A is faster

Common Differences

Observation	Likely Cause
Different kernel counts	Library differences (cuBLAS vs rocBLAS)
Large time variance	Different fusion strategies
Unmatched kernels	Platform-specific optimizations
Memory time difference	Different memory subsystems

Next Steps

NVIDIA Profiling

Capture NVIDIA traces.

AMD Profiling

Capture AMD traces.

TraceLens

Compare NVIDIA traces.

AI Agent

Analyze traces with AI.

Getting Started

CLI

AI Agent

Kernel Development

NVIDIA Profiling

NCU Profiler

Perfetto

AMD Profiling

ROCprofiler Compute

Infrastructure

Compare

Onboarding

More

Cross-Platform Comparison

Cross-Platform Comparison

Quick Start

Commands

wafer compare analyze

wafer compare fusion

wafer compare align

Supported Trace Formats

Use Cases

NVIDIA vs AMD Comparison

Optimization Validation

Kernel Matching

Interpretation

Performance Ratio

Common Differences

Next Steps

NVIDIA Profiling

AMD Profiling

TraceLens

AI Agent

Getting Started

CLI

AI Agent

Kernel Development

NVIDIA Profiling

NCU Profiler

Perfetto

AMD Profiling

ROCprofiler Compute

Infrastructure

Compare

Onboarding

More

​Cross-Platform Comparison

​Quick Start

​Commands

​wafer compare analyze

​wafer compare fusion

​wafer compare align

​Supported Trace Formats

​Use Cases

​NVIDIA vs AMD Comparison

​Optimization Validation

​Kernel Matching

​Interpretation

​Performance Ratio

​Common Differences

​Next Steps

NVIDIA Profiling

AMD Profiling

TraceLens

AI Agent

Cross-Platform Comparison

Quick Start

Commands

wafer compare analyze

wafer compare fusion

wafer compare align

Supported Trace Formats

Use Cases

NVIDIA vs AMD Comparison

Optimization Validation

Kernel Matching

Interpretation

Performance Ratio

Common Differences

Next Steps