Skip to main content

ROCprofiler Systems

ROCprofiler Systems (formerly Omnitrace) provides system-wide profiling for AMD GPU applications. It captures CPU-GPU interaction, kernel timelines, memory transfers, and system events—similar to NVIDIA Nsight Systems.

Quick Start

# Check installation
wafer amd rocprof-systems check

# Profile an application
wafer amd rocprof-systems run "python train.py"

# Analyze results
wafer amd rocprof-systems analyze ./rocprof-systems-output

# Sample-based profiling
wafer amd rocprof-systems sample "python train.py"

Commands

wafer amd rocprof-systems check

Verify installation:
wafer amd rocprof-systems check
Output:
ROCprofiler Systems: installed
Version: 2.0.0
Path: /opt/rocm/bin/rocprof-sys

wafer amd rocprof-systems run

Profile an application:
wafer amd rocprof-systems run [OPTIONS] "<command>"
Options:
OptionDescription
--output, -oOutput directory
--traceTrace types: hip, hsa, marker, kernel
--profileProfile types: cpu, gpu, memory
--durationMaximum profiling duration
Examples:
# Basic profiling
wafer amd rocprof-systems run "python train.py"

# Profile with specific traces
wafer amd rocprof-systems run \
  --trace hip,kernel,marker \
  "python train.py"

# Profile CPU and GPU
wafer amd rocprof-systems run \
  --profile cpu,gpu \
  "python train.py"

# Limit duration
wafer amd rocprof-systems run \
  --duration 60 \
  "python server.py"

wafer amd rocprof-systems analyze

Analyze profiling output:
wafer amd rocprof-systems analyze [OPTIONS] <output-dir>
Options:
OptionDescription
--summaryShow summary statistics
--kernelsList kernel timings
--transfersMemory transfer analysis
--jsonOutput as JSON
Example:
wafer amd rocprof-systems analyze ./rocprof-systems-output --summary
Output:
Profile Summary
===============

Duration: 45.23s
GPU Kernels: 12,345
HIP API Calls: 23,456
Memory Transfers: 1,234

Top Kernels by Time:
  1. gfx942::matmul_kernel     34.2% (15.47s)
  2. gfx942::attention_kernel  21.3% (9.63s)
  3. gfx942::softmax_kernel    12.1% (5.47s)

Memory Transfers:
  Host→Device: 4.5 GB (456 transfers)
  Device→Host: 2.1 GB (234 transfers)

CPU-GPU Sync Time: 3.21s (7.1%)

wafer amd rocprof-systems sample

Sample-based CPU profiling with GPU correlation:
wafer amd rocprof-systems sample [OPTIONS] "<command>"
Options:
OptionDescription
--frequencySampling frequency in Hz (default: 1000)
--output, -oOutput directory
Example:
wafer amd rocprof-systems sample --frequency 100 "python train.py"

wafer amd rocprof-systems instrument

Binary instrumentation for detailed tracing:
wafer amd rocprof-systems instrument [OPTIONS] "<command>"
Options:
OptionDescription
--functionsFunctions to instrument
--modulesModules to instrument

wafer amd rocprof-systems query

Query profiling data:
wafer amd rocprof-systems query [OPTIONS] <output-dir>
Options:
OptionDescription
--sqlSQL query for trace data
--kernelFilter by kernel name

What Gets Captured

ROCprofiler Systems traces:
  • HIP API calls: hipMemcpy, hipLaunchKernel, etc.
  • HSA events: Low-level GPU runtime events
  • GPU kernels: Execution times, grid dimensions
  • Memory transfers: H2D, D2H, D2D copies
  • CPU activity: Thread scheduling, function calls
  • Markers: Custom annotations (roctx)
  • System events: Context switches, interrupts

Adding Markers

Annotate your code for better visibility:
import roctx

# Function annotation
@roctx.range("forward_pass")
def forward(model, x):
    return model(x)

# Context manager
with roctx.range("backward_pass"):
    loss.backward()

# Manual push/pop
roctx.mark("checkpoint")
Then profile with markers:
wafer amd rocprof-systems run --trace marker,kernel "python train.py"

Output Formats

ROCprofiler Systems generates multiple output formats:
FormatFileUse
Perfetto.protoChrome trace viewer
JSON.jsonCustom analysis
Text.txtQuick summary
View Perfetto traces at ui.perfetto.dev.

Comparison with Nsys

FeatureROCprof SystemsNsys
GPU RuntimeHIP/HSACUDA
CPU ProfilingYesYes
MarkersroctxNVTX
OutputPerfetto/JSONnsys-rep
Timeline ViewPerfetto UINsys UI

Troubleshooting

Install from ROCm or build from source:
sudo apt install rocprofiler-systems
# or
export PATH=/opt/rocm/bin:$PATH
Ensure HIP tracing is enabled:
wafer amd rocprof-systems run --trace hip,kernel "python train.py"
Limit trace scope or duration:
wafer amd rocprof-systems run --duration 30 --trace kernel "python train.py"

Next Steps