Skip to main content

Analyzing NCU Reports

Learn how to open, navigate, and understand Nsight Compute performance reports in Wafer.

Opening a Report

1

Select NCU Profiler

Open the Wafer panel and select NCU Profiler from the tool dropdown.
2

Click Select File

Click the Select .ncu-rep file button in the tool panel.
3

Choose Your Report

Navigate to and select an .ncu-rep file. The report will load and display in the panel.

Understanding the Interface

Kernel Summary

When you open a report, you’ll see a table listing all profiled kernels with key metrics:
ColumnDescription
Kernel NameThe CUDA kernel function name
DurationExecution time in microseconds (µs)
Memory %Memory throughput as percentage of peak
Compute %Compute throughput as percentage of peak
OccupancyAchieved occupancy percentage
RegistersRegisters used per thread
Block SizeThreads per block
Grid SizeTotal number of blocks
Click on column headers to sort kernels by that metric. Click a kernel row to select it and view detailed diagnostics.

Sorting and Selection

  • Click a column header to sort by that metric
  • Click a kernel row to select it and view its diagnostics
  • Right-click a kernel for additional options (copy to clipboard, save)

Performance Diagnostics

The diagnostics panel shows optimization recommendations for the selected kernel:
  • Bottleneck identification — What’s limiting performance (compute, memory, latency)
  • Actionable recommendations — Specific suggestions for improvement
  • Metric context — Explanation of why certain metrics matter
Diagnostics are generated by analyzing the NCU report data. Expand the panel to see full recommendations.

Key Metrics Explained

Duration

The total execution time of the kernel. Lower is better, but raw duration alone doesn’t tell you if the kernel is efficient.

Memory Throughput %

How effectively the kernel uses memory bandwidth compared to the GPU’s theoretical peak. A high percentage means you’re memory-bound—optimize memory access patterns.

Compute Throughput %

How effectively the kernel uses compute resources. A high percentage means you’re compute-bound—optimize arithmetic operations or increase occupancy.

Achieved Occupancy

The percentage of maximum possible warps that were active on average. Low occupancy may indicate:
  • Too many registers per thread
  • Too much shared memory per block
  • Small grid size

Registers per Thread

The number of registers used by each thread. High register usage can limit occupancy. Consider:
  • Using __launch_bounds__ to hint register limits
  • Moving data to shared memory
  • Simplifying computations

Tips for Analysis

Sort by duration to identify the kernels that take the most time. These are your optimization targets.
If memory throughput is high but compute is low, you’re memory-bound. If compute is high but memory is low, you’re compute-bound. Balance both for optimal performance.
Low occupancy often indicates register pressure or shared memory limits. Check the registers-per-thread metric.
The diagnostics panel provides specific recommendations based on the metrics. Use these as starting points for optimization.

Exporting Results

You can export kernel data for use in other tools:
  • Right-click a kernel and select Copy as CSV to copy metrics to clipboard
  • Use the export button to save all kernel data
This is useful for:
  • Tracking performance over time
  • Sharing results in issues or PRs
  • Importing into spreadsheets for analysis

Next Steps