Skip to main content

Understanding PTX and SASS

The Compiler Explorer generates two types of assembly output. Understanding these helps you optimize your CUDA kernels at the lowest level.

PTX (Parallel Thread Execution)

PTX is NVIDIA’s virtual instruction set—an intermediate representation between your CUDA code and the actual GPU machine code.

What PTX Shows You

  • How your code maps to GPU operations
  • Register allocation and usage
  • Memory access patterns (global, shared, local)
  • Control flow structure

PTX Characteristics

AspectDescription
ArchitectureVirtual—same PTX runs on different GPUs
RegistersVirtual registers (unlimited)
ReadabilityMore readable than SASS
OptimizationSome optimizations applied, but not final

Example PTX

// Simple vector add kernel
.visible .entry vectorAdd(
    .param .u64 vectorAdd_param_0,  // float* a
    .param .u64 vectorAdd_param_1,  // float* b  
    .param .u64 vectorAdd_param_2,  // float* c
    .param .u32 vectorAdd_param_3   // int n
)
{
    .reg .pred %p<2>;
    .reg .f32 %f<4>;
    .reg .b32 %r<5>;
    .reg .b64 %rd<10>;
    
    // Calculate thread index
    mov.u32 %r1, %tid.x;
    mov.u32 %r2, %ctaid.x;
    mov.u32 %r3, %ntid.x;
    mad.lo.s32 %r4, %r2, %r3, %r1;
    
    // Load and add
    ld.global.f32 %f1, [%rd4];
    ld.global.f32 %f2, [%rd6];
    add.f32 %f3, %f1, %f2;
    st.global.f32 [%rd8], %f3;
}

Key PTX Instructions

InstructionMeaning
ld.globalLoad from global memory
st.globalStore to global memory
ld.sharedLoad from shared memory
add.f32Single-precision floating-point add
mul.f32Single-precision floating-point multiply
fma.rn.f32Fused multiply-add
movMove/copy data
madMultiply-add (integer)

SASS (Shader Assembly)

SASS is the actual machine code that runs on your specific GPU. It’s generated from PTX by the driver or nvdisasm.

What SASS Shows You

  • The actual instructions your GPU executes
  • Real register allocation (limited physical registers)
  • Architecture-specific optimizations
  • True instruction latencies

SASS Characteristics

AspectDescription
ArchitectureSpecific to GPU generation (sm_90, sm_80, etc.)
RegistersPhysical registers (limited per thread)
ReadabilityLess readable, more cryptic
OptimizationFinal optimized code

Example SASS

/*0000*/  IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28]
/*0010*/  S2R R0, SR_TID.X
/*0020*/  IMAD R0, R0, 0x1, c[0x0][0x0]
/*0030*/  ISETP.GE.AND P0, PT, R0, c[0x0][0xc], PT
/*0040*/  @P0 EXIT
/*0050*/  IMAD.WIDE.U32 R2, R0, 0x4, c[0x0][0x10]
/*0060*/  IMAD.WIDE.U32 R4, R0, 0x4, c[0x0][0x18]
/*0070*/  LDG.E R2, [R2]
/*0080*/  LDG.E R3, [R4]
/*0090*/  IMAD.WIDE.U32 R4, R0, 0x4, c[0x0][0x20]
/*00a0*/  FADD R2, R2, R3
/*00b0*/  STG.E [R4], R2
/*00c0*/  EXIT

Key SASS Instructions

InstructionMeaning
LDGLoad from global memory
STGStore to global memory
LDSLoad from shared memory
STSStore to shared memory
FADDFloating-point add
FMULFloating-point multiply
FFMAFused floating-point multiply-add
IMADInteger multiply-add
S2RSpecial register read (thread ID, etc.)

Comparing PTX and SASS

AspectPTXSASS
LevelIntermediateFinal
PortabilityCross-GPUGPU-specific
OptimizationPartialComplete
Use caseUnderstanding logicPerformance analysis
For performance optimization, SASS is more relevant because it shows what actually executes. PTX is useful for understanding the logical structure of your kernel.

Using the Side-by-Side View

The side-by-side view lets you compare different outputs:
  1. Click the side-by-side icon in the results header
  2. Select what to show in each panel (Source, PTX, or SASS)
  3. Scroll through to see how source code maps to assembly

What to Look For

Look for coalesced vs. scattered memory accesses:
  • LDG.E with consecutive addresses = coalesced (good)
  • Many LDG.E with varying offsets = scattered (potential issue)
High register usage limits occupancy. Look at the register declarations in PTX (.reg) and compare to your GPU’s register file size.
A healthy kernel has a balance of compute and memory instructions. Too many memory ops relative to compute may indicate memory-bound code.
Look for @P0 predicated instructions and branch instructions. Divergent control flow can hurt performance.

SASS Unavailable

If SASS output is not available:
nvdisasm required for SASS output
Solution: Ensure nvdisasm is installed and in your PATH. It’s included with the CUDA Toolkit.
# Verify nvdisasm
nvdisasm --version
PTX is always available when nvcc compiles successfully. SASS requires the additional nvdisasm tool.

Next Steps