Understanding PTX and SASS
The Compiler Explorer generates two types of assembly output. Understanding these helps you optimize your CUDA kernels at the lowest level.PTX (Parallel Thread Execution)
PTX is NVIDIA’s virtual instruction set—an intermediate representation between your CUDA code and the actual GPU machine code.What PTX Shows You
- How your code maps to GPU operations
- Register allocation and usage
- Memory access patterns (global, shared, local)
- Control flow structure
PTX Characteristics
| Aspect | Description |
|---|---|
| Architecture | Virtual—same PTX runs on different GPUs |
| Registers | Virtual registers (unlimited) |
| Readability | More readable than SASS |
| Optimization | Some optimizations applied, but not final |
Example PTX
Key PTX Instructions
| Instruction | Meaning |
|---|---|
ld.global | Load from global memory |
st.global | Store to global memory |
ld.shared | Load from shared memory |
add.f32 | Single-precision floating-point add |
mul.f32 | Single-precision floating-point multiply |
fma.rn.f32 | Fused multiply-add |
mov | Move/copy data |
mad | Multiply-add (integer) |
SASS (Shader Assembly)
SASS is the actual machine code that runs on your specific GPU. It’s generated from PTX by the driver or nvdisasm.What SASS Shows You
- The actual instructions your GPU executes
- Real register allocation (limited physical registers)
- Architecture-specific optimizations
- True instruction latencies
SASS Characteristics
| Aspect | Description |
|---|---|
| Architecture | Specific to GPU generation (sm_90, sm_80, etc.) |
| Registers | Physical registers (limited per thread) |
| Readability | Less readable, more cryptic |
| Optimization | Final optimized code |
Example SASS
Key SASS Instructions
| Instruction | Meaning |
|---|---|
LDG | Load from global memory |
STG | Store to global memory |
LDS | Load from shared memory |
STS | Store to shared memory |
FADD | Floating-point add |
FMUL | Floating-point multiply |
FFMA | Fused floating-point multiply-add |
IMAD | Integer multiply-add |
S2R | Special register read (thread ID, etc.) |
Comparing PTX and SASS
| Aspect | PTX | SASS |
|---|---|---|
| Level | Intermediate | Final |
| Portability | Cross-GPU | GPU-specific |
| Optimization | Partial | Complete |
| Use case | Understanding logic | Performance analysis |
For performance optimization, SASS is more relevant because it shows what actually executes. PTX is useful for understanding the logical structure of your kernel.
Using the Side-by-Side View
The side-by-side view lets you compare different outputs:- Click the side-by-side icon in the results header
- Select what to show in each panel (Source, PTX, or SASS)
- Scroll through to see how source code maps to assembly
What to Look For
Memory Access Patterns
Memory Access Patterns
Look for coalesced vs. scattered memory accesses:
LDG.Ewith consecutive addresses = coalesced (good)- Many
LDG.Ewith varying offsets = scattered (potential issue)
Register Pressure
Register Pressure
High register usage limits occupancy. Look at the register declarations in PTX (
.reg) and compare to your GPU’s register file size.Instruction Mix
Instruction Mix
A healthy kernel has a balance of compute and memory instructions. Too many memory ops relative to compute may indicate memory-bound code.
Control Flow
Control Flow
Look for
@P0 predicated instructions and branch instructions. Divergent control flow can hurt performance.SASS Unavailable
If SASS output is not available:nvdisasm is installed and in your PATH. It’s included with the CUDA Toolkit.
PTX is always available when nvcc compiles successfully. SASS requires the additional nvdisasm tool.