Nsight Systems

NVIDIA Nsight Systems (nsys) provides system-wide profiling to understand CPU-GPU interaction, kernel launches, memory transfers, and overall application behavior. Wafer integrates nsys for easy profiling and analysis.

Quick Start

# Check nsys installation
wafer nvidia nsys check

# Profile a command
wafer nvidia nsys profile "python train.py"

# Analyze an existing trace
wafer nvidia nsys analyze ./profile.nsys-rep

Commands

wafer nvidia nsys check

Verify nsys installation and version:

wafer nvidia nsys check

Output:

Nsight Systems: installed
Version: 2024.1.1
Path: /usr/local/cuda/bin/nsys

wafer nvidia nsys profile

Capture a system-wide profile:

wafer nvidia nsys profile [OPTIONS] "<command>"

Options:

Option	Description
`--output`, `-o`	Output file path (default: profile.nsys-rep)
`--trace`	What to trace: `cuda`, `nvtx`, `osrt`, `cudnn`, etc.
`--duration`	Maximum capture duration in seconds
`--delay`	Delay before starting capture
`--sample`	Enable CPU sampling
`--target`, `-t`	Run on remote GPU target

Examples:

# Basic profile
wafer nvidia nsys profile "python train.py"

# Profile with NVTX markers and CUDA API
wafer nvidia nsys profile --trace cuda,nvtx "python train.py"

# Profile for 30 seconds max
wafer nvidia nsys profile --duration 30 "python server.py"

# Profile on remote target
wafer nvidia nsys profile --target h100-box "python train.py"

wafer nvidia nsys analyze

Analyze an nsys trace file:

wafer nvidia nsys analyze [OPTIONS] <trace-file>

Options:

Option	Description
`--summary`	Show summary statistics
`--kernels`	List kernel execution times
`--transfers`	Show memory transfer analysis
`--json`	Output as JSON

Example:

wafer nvidia nsys analyze ./profile.nsys-rep --summary

Output:

Profile Summary
===============
Duration: 12.34s
GPU Kernels: 1,234
Memory Transfers: 567

Top Kernels by Time:
  1. volta_sgemm_128x64_nn   45.2% (5.58s)
  2. elementwise_kernel      23.1% (2.85s)
  3. reduce_kernel           12.3% (1.52s)

Memory Transfers:
  Host→Device: 2.3 GB (234 transfers)
  Device→Host: 1.1 GB (123 transfers)

What Nsys Captures

Nsight Systems traces multiple aspects of your application:

CUDA API calls: cudaMemcpy, cudaMalloc, kernel launches
GPU kernels: Execution time, grid/block dimensions
Memory transfers: H2D, D2H, D2D copies
CPU activity: Thread scheduling, function calls
NVTX markers: Custom annotations in your code
cuDNN/cuBLAS: Library call timings

Adding NVTX Markers

Annotate your code for better trace visibility:

import torch
import nvtx

@nvtx.annotate("forward_pass")
def forward(model, x):
    return model(x)

@nvtx.annotate("backward_pass")
def backward(loss):
    loss.backward()

# Or use context managers
with nvtx.annotate("data_loading"):
    batch = next(dataloader)

Then profile:

wafer nvidia nsys profile --trace cuda,nvtx "python train.py"

Comparing with NCU

Aspect	Nsys	NCU
Scope	System-wide	Single kernel
Overhead	Low	High
Detail	Timeline, API calls	Hardware counters
Use case	Find bottlenecks	Optimize kernels

Workflow: Use nsys first to identify slow kernels, then NCU for deep analysis.

Remote Profiling

Profile on remote GPU targets:

# Profile on configured target
wafer nvidia nsys profile --target my-h100 "python train.py"

# Profile in workspace
wafer workspaces exec my-workspace "wafer nvidia nsys profile 'python train.py'"

Troubleshooting

nsys not found

Install NVIDIA Nsight Systems:

Download from NVIDIA Developer
Or install with CUDA Toolkit
Ensure nsys is on your PATH

Permission denied

On Linux, you may need elevated privileges:

sudo wafer nvidia nsys profile "python train.py"

Or configure paranoid level:

sudo sh -c 'echo 1 > /proc/sys/kernel/perf_event_paranoid'

Trace file too large

Limit capture duration or scope:

wafer nvidia nsys profile --duration 10 --trace cuda "python train.py"

Next Steps

TraceLens Reports

Generate performance reports from nsys traces.

NCU Profiler

Deep dive into kernel performance.

Perfetto

Visual trace analysis.

AI Agent

Use trace-analyze template.

Getting Started

CLI

AI Agent

Kernel Development

NVIDIA Profiling

NCU Profiler

Perfetto

AMD Profiling

ROCprofiler Compute

Infrastructure

Compare

Onboarding

More

Nsight Systems

Nsight Systems

Quick Start

Commands

wafer nvidia nsys check

wafer nvidia nsys profile

wafer nvidia nsys analyze

What Nsys Captures

Adding NVTX Markers

Comparing with NCU

Remote Profiling

Troubleshooting

Next Steps

TraceLens Reports

NCU Profiler

Perfetto

AI Agent

Getting Started

CLI

AI Agent

Kernel Development

NVIDIA Profiling

NCU Profiler

Perfetto

AMD Profiling

ROCprofiler Compute

Infrastructure

Compare

Onboarding

More

​Nsight Systems

​Quick Start

​Commands

​wafer nvidia nsys check

​wafer nvidia nsys profile

​wafer nvidia nsys analyze

​What Nsys Captures

​Adding NVTX Markers

​Comparing with NCU

​Remote Profiling

​Troubleshooting

​Next Steps

TraceLens Reports

NCU Profiler

Perfetto

AI Agent

Nsight Systems

Quick Start

Commands

wafer nvidia nsys check

wafer nvidia nsys profile

wafer nvidia nsys analyze

What Nsys Captures

Adding NVTX Markers

Comparing with NCU

Remote Profiling

Troubleshooting

Next Steps