Skip to main content

Nsight Systems

NVIDIA Nsight Systems (nsys) provides system-wide profiling to understand CPU-GPU interaction, kernel launches, memory transfers, and overall application behavior. Wafer integrates nsys for easy profiling and analysis.

Quick Start

# Check nsys installation
wafer nvidia nsys check

# Profile a command
wafer nvidia nsys profile "python train.py"

# Analyze an existing trace
wafer nvidia nsys analyze ./profile.nsys-rep

Commands

wafer nvidia nsys check

Verify nsys installation and version:
wafer nvidia nsys check
Output:
Nsight Systems: installed
Version: 2024.1.1
Path: /usr/local/cuda/bin/nsys

wafer nvidia nsys profile

Capture a system-wide profile:
wafer nvidia nsys profile [OPTIONS] "<command>"
Options:
OptionDescription
--output, -oOutput file path (default: profile.nsys-rep)
--traceWhat to trace: cuda, nvtx, osrt, cudnn, etc.
--durationMaximum capture duration in seconds
--delayDelay before starting capture
--sampleEnable CPU sampling
--target, -tRun on remote GPU target
Examples:
# Basic profile
wafer nvidia nsys profile "python train.py"

# Profile with NVTX markers and CUDA API
wafer nvidia nsys profile --trace cuda,nvtx "python train.py"

# Profile for 30 seconds max
wafer nvidia nsys profile --duration 30 "python server.py"

# Profile on remote target
wafer nvidia nsys profile --target h100-box "python train.py"

wafer nvidia nsys analyze

Analyze an nsys trace file:
wafer nvidia nsys analyze [OPTIONS] <trace-file>
Options:
OptionDescription
--summaryShow summary statistics
--kernelsList kernel execution times
--transfersShow memory transfer analysis
--jsonOutput as JSON
Example:
wafer nvidia nsys analyze ./profile.nsys-rep --summary
Output:
Profile Summary
===============
Duration: 12.34s
GPU Kernels: 1,234
Memory Transfers: 567

Top Kernels by Time:
  1. volta_sgemm_128x64_nn   45.2% (5.58s)
  2. elementwise_kernel      23.1% (2.85s)
  3. reduce_kernel           12.3% (1.52s)

Memory Transfers:
  Host→Device: 2.3 GB (234 transfers)
  Device→Host: 1.1 GB (123 transfers)

What Nsys Captures

Nsight Systems traces multiple aspects of your application:
  • CUDA API calls: cudaMemcpy, cudaMalloc, kernel launches
  • GPU kernels: Execution time, grid/block dimensions
  • Memory transfers: H2D, D2H, D2D copies
  • CPU activity: Thread scheduling, function calls
  • NVTX markers: Custom annotations in your code
  • cuDNN/cuBLAS: Library call timings

Adding NVTX Markers

Annotate your code for better trace visibility:
import torch
import nvtx

@nvtx.annotate("forward_pass")
def forward(model, x):
    return model(x)

@nvtx.annotate("backward_pass")
def backward(loss):
    loss.backward()

# Or use context managers
with nvtx.annotate("data_loading"):
    batch = next(dataloader)
Then profile:
wafer nvidia nsys profile --trace cuda,nvtx "python train.py"

Comparing with NCU

AspectNsysNCU
ScopeSystem-wideSingle kernel
OverheadLowHigh
DetailTimeline, API callsHardware counters
Use caseFind bottlenecksOptimize kernels
Workflow: Use nsys first to identify slow kernels, then NCU for deep analysis.

Remote Profiling

Profile on remote GPU targets:
# Profile on configured target
wafer nvidia nsys profile --target my-h100 "python train.py"

# Profile in workspace
wafer workspaces exec my-workspace "wafer nvidia nsys profile 'python train.py'"

Troubleshooting

Install NVIDIA Nsight Systems:
  • Download from NVIDIA Developer
  • Or install with CUDA Toolkit
  • Ensure nsys is on your PATH
On Linux, you may need elevated privileges:
sudo wafer nvidia nsys profile "python train.py"
Or configure paranoid level:
sudo sh -c 'echo 1 > /proc/sys/kernel/perf_event_paranoid'
Limit capture duration or scope:
wafer nvidia nsys profile --duration 10 --trace cuda "python train.py"

Next Steps