Analyzing NCU Reports
Learn how to open, navigate, and understand Nsight Compute performance reports in Wafer.Opening a Report
1
Select NCU Profiler
Open the Wafer panel and select NCU Profiler from the tool dropdown.
2
Click Select File
Click the Select .ncu-rep file button in the tool panel.
3
Choose Your Report
Navigate to and select an
.ncu-rep file. The report will load and display in the panel.Understanding the Interface
Kernel Summary
When you open a report, you’ll see a table listing all profiled kernels with key metrics:| Column | Description |
|---|---|
| Kernel Name | The CUDA kernel function name |
| Duration | Execution time in microseconds (µs) |
| Memory % | Memory throughput as percentage of peak |
| Compute % | Compute throughput as percentage of peak |
| Occupancy | Achieved occupancy percentage |
| Registers | Registers used per thread |
| Block Size | Threads per block |
| Grid Size | Total number of blocks |
Sorting and Selection
- Click a column header to sort by that metric
- Click a kernel row to select it and view its diagnostics
- Right-click a kernel for additional options (copy to clipboard, save)
Performance Diagnostics
The diagnostics panel shows optimization recommendations for the selected kernel:- Bottleneck identification — What’s limiting performance (compute, memory, latency)
- Actionable recommendations — Specific suggestions for improvement
- Metric context — Explanation of why certain metrics matter
Diagnostics are generated by analyzing the NCU report data. Expand the panel to see full recommendations.
Key Metrics Explained
Duration
The total execution time of the kernel. Lower is better, but raw duration alone doesn’t tell you if the kernel is efficient.Memory Throughput %
How effectively the kernel uses memory bandwidth compared to the GPU’s theoretical peak. A high percentage means you’re memory-bound—optimize memory access patterns.Compute Throughput %
How effectively the kernel uses compute resources. A high percentage means you’re compute-bound—optimize arithmetic operations or increase occupancy.Achieved Occupancy
The percentage of maximum possible warps that were active on average. Low occupancy may indicate:- Too many registers per thread
- Too much shared memory per block
- Small grid size
Registers per Thread
The number of registers used by each thread. High register usage can limit occupancy. Consider:- Using
__launch_bounds__to hint register limits - Moving data to shared memory
- Simplifying computations
Tips for Analysis
Start with the slowest kernels
Start with the slowest kernels
Sort by duration to identify the kernels that take the most time. These are your optimization targets.
Look at throughput balance
Look at throughput balance
If memory throughput is high but compute is low, you’re memory-bound. If compute is high but memory is low, you’re compute-bound. Balance both for optimal performance.
Check occupancy limiters
Check occupancy limiters
Low occupancy often indicates register pressure or shared memory limits. Check the registers-per-thread metric.
Use diagnostics for guidance
Use diagnostics for guidance
The diagnostics panel provides specific recommendations based on the metrics. Use these as starting points for optimization.
Exporting Results
You can export kernel data for use in other tools:- Right-click a kernel and select Copy as CSV to copy metrics to clipboard
- Use the export button to save all kernel data
- Tracking performance over time
- Sharing results in issues or PRs
- Importing into spreadsheets for analysis