Skip to main content

Cross-Platform Comparison

The wafer compare command analyzes and compares GPU traces across different platforms. Compare NVIDIA and AMD implementations, align kernel executions, and understand performance differences.

Quick Start

# Compare two traces
wafer compare analyze ./nvidia-trace.nsys-rep ./amd-trace.json

# Fusion comparison
wafer compare fusion ./baseline.json ./optimized.json

# Align and compare
wafer compare align ./trace-a.json ./trace-b.json

Commands

wafer compare analyze

Compare GPU traces from different sources:
wafer compare analyze [OPTIONS] <trace1> <trace2>
Options:
OptionDescription
--output, -oOutput file path
--formatOutput format: text, json, html
--metricComparison metric: time, throughput, efficiency
Example:
wafer compare analyze ./nvidia.nsys-rep ./amd.json --format html -o comparison.html
Output:
Cross-Platform Comparison
=========================

Trace A: nvidia.nsys-rep (NVIDIA H100)
Trace B: amd.json (AMD MI300X)

Overall:
                    NVIDIA H100    AMD MI300X    Difference
  Total Time        12.34s         14.21s        +15.2%
  GPU Active        10.12s         11.89s        +17.5%
  Kernel Count      1,234          1,234         -

Kernel Comparison (matched):
  matmul_kernel     5.23ms         4.89ms        -6.5% (AMD faster)
  attention         3.12ms         3.78ms        +21.2% (NVIDIA faster)
  softmax           1.02ms         0.98ms        -3.9%

Unmatched Kernels:
  NVIDIA-only: volta_sgemm_128x64 (cuBLAS)
  AMD-only: rocblas_sgemm_kernel

wafer compare fusion

Compare fusion patterns between implementations:
wafer compare fusion [OPTIONS] <trace1> <trace2>
Options:
OptionDescription
--output, -oOutput file path
--thresholdTime threshold for fusion detection
Example:
wafer compare fusion ./baseline.json ./optimized.json
Output:
Fusion Analysis
===============

Baseline: 23 separate kernel launches
Optimized: 8 fused kernels

Detected Fusions:
  1. softmax + dropout + scale → fused_attention_1
     Baseline: 3.2ms (3 kernels)
     Fused: 1.1ms (1 kernel)
     Speedup: 2.9x

  2. layernorm + add + gelu → fused_mlp_1
     Baseline: 2.8ms (3 kernels)
     Fused: 0.9ms (1 kernel)
     Speedup: 3.1x

Total fusion benefit: 4.2ms saved (18% of kernel time)

wafer compare align

Align kernel executions between traces:
wafer compare align [OPTIONS] <trace1> <trace2>
Options:
OptionDescription
--output, -oOutput file path
--methodAlignment method: name, sequence, hybrid
Example:
wafer compare align ./run1.json ./run2.json --method sequence
Output:
Kernel Alignment
================

Alignment Method: sequence
Matched Kernels: 1,234 / 1,240 (99.5%)

Aligned Pairs:
  #    Trace A                    Trace B                    Time Diff
  1    matmul_kernel[0]           matmul_kernel[0]           +0.02ms
  2    softmax_kernel[0]          softmax_kernel[0]          -0.01ms
  3    attention_kernel[0]        attention_kernel[0]        +0.15ms
  ...

Unaligned (Trace A only):
  - debug_print_kernel (6 instances)

Unaligned (Trace B only):
  (none)

Supported Trace Formats

FormatExtensionPlatform
Nsight Systems.nsys-repNVIDIA
NCU Report.ncu-repNVIDIA
ROCprof Systems.jsonAMD
ROCprof Compute.csvAMD
Perfetto.perfetto, .pftraceBoth
PyTorch Trace.jsonBoth

Use Cases

NVIDIA vs AMD Comparison

Compare the same workload on different GPUs:
# Profile on NVIDIA
wafer nvidia nsys profile -o nvidia.nsys-rep "python train.py"

# Profile on AMD
wafer amd rocprof-systems run -o amd-output "python train.py"

# Compare
wafer compare analyze ./nvidia.nsys-rep ./amd-output/trace.json

Optimization Validation

Verify optimizations work across platforms:
# Baseline on both platforms
wafer compare analyze ./baseline-nvidia.nsys-rep ./baseline-amd.json -o baseline.html

# Optimized on both platforms
wafer compare analyze ./optimized-nvidia.nsys-rep ./optimized-amd.json -o optimized.html

Kernel Matching

Understand how similar operations map between platforms:
wafer compare align \
  --method hybrid \
  ./nvidia-trace.json \
  ./amd-trace.json \
  -o alignment.json

Interpretation

Performance Ratio

  • < 1.0: Trace B is faster
  • = 1.0: Equal performance
  • > 1.0: Trace A is faster

Common Differences

ObservationLikely Cause
Different kernel countsLibrary differences (cuBLAS vs rocBLAS)
Large time varianceDifferent fusion strategies
Unmatched kernelsPlatform-specific optimizations
Memory time differenceDifferent memory subsystems

Next Steps