Understanding PTX and SASS

The Compiler Explorer generates two types of assembly output. Understanding these helps you optimize your CUDA kernels at the lowest level.

PTX (Parallel Thread Execution)

PTX is NVIDIA’s virtual instruction set—an intermediate representation between your CUDA code and the actual GPU machine code.

What PTX Shows You

How your code maps to GPU operations
Register allocation and usage
Memory access patterns (global, shared, local)
Control flow structure

PTX Characteristics

Aspect	Description
Architecture	Virtual—same PTX runs on different GPUs
Registers	Virtual registers (unlimited)
Readability	More readable than SASS
Optimization	Some optimizations applied, but not final

Example PTX

// Simple vector add kernel
.visible .entry vectorAdd(
    .param .u64 vectorAdd_param_0,  // float* a
    .param .u64 vectorAdd_param_1,  // float* b  
    .param .u64 vectorAdd_param_2,  // float* c
    .param .u32 vectorAdd_param_3   // int n
)
{
    .reg .pred %p<2>;
    .reg .f32 %f<4>;
    .reg .b32 %r<5>;
    .reg .b64 %rd<10>;
    
    // Calculate thread index
    mov.u32 %r1, %tid.x;
    mov.u32 %r2, %ctaid.x;
    mov.u32 %r3, %ntid.x;
    mad.lo.s32 %r4, %r2, %r3, %r1;
    
    // Load and add
    ld.global.f32 %f1, [%rd4];
    ld.global.f32 %f2, [%rd6];
    add.f32 %f3, %f1, %f2;
    st.global.f32 [%rd8], %f3;
}

Key PTX Instructions

Instruction	Meaning
`ld.global`	Load from global memory
`st.global`	Store to global memory
`ld.shared`	Load from shared memory
`add.f32`	Single-precision floating-point add
`mul.f32`	Single-precision floating-point multiply
`fma.rn.f32`	Fused multiply-add
`mov`	Move/copy data
`mad`	Multiply-add (integer)

SASS (Shader Assembly)

SASS is the actual machine code that runs on your specific GPU. It’s generated from PTX by the driver or nvdisasm.

What SASS Shows You

The actual instructions your GPU executes
Real register allocation (limited physical registers)
Architecture-specific optimizations
True instruction latencies

SASS Characteristics

Aspect	Description
Architecture	Specific to GPU generation (sm_90, sm_80, etc.)
Registers	Physical registers (limited per thread)
Readability	Less readable, more cryptic
Optimization	Final optimized code

Example SASS

/*0000*/  IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28]
/*0010*/  S2R R0, SR_TID.X
/*0020*/  IMAD R0, R0, 0x1, c[0x0][0x0]
/*0030*/  ISETP.GE.AND P0, PT, R0, c[0x0][0xc], PT
/*0040*/  @P0 EXIT
/*0050*/  IMAD.WIDE.U32 R2, R0, 0x4, c[0x0][0x10]
/*0060*/  IMAD.WIDE.U32 R4, R0, 0x4, c[0x0][0x18]
/*0070*/  LDG.E R2, [R2]
/*0080*/  LDG.E R3, [R4]
/*0090*/  IMAD.WIDE.U32 R4, R0, 0x4, c[0x0][0x20]
/*00a0*/  FADD R2, R2, R3
/*00b0*/  STG.E [R4], R2
/*00c0*/  EXIT

Key SASS Instructions

Instruction	Meaning
`LDG`	Load from global memory
`STG`	Store to global memory
`LDS`	Load from shared memory
`STS`	Store to shared memory
`FADD`	Floating-point add
`FMUL`	Floating-point multiply
`FFMA`	Fused floating-point multiply-add
`IMAD`	Integer multiply-add
`S2R`	Special register read (thread ID, etc.)

Comparing PTX and SASS

Aspect	PTX	SASS
Level	Intermediate	Final
Portability	Cross-GPU	GPU-specific
Optimization	Partial	Complete
Use case	Understanding logic	Performance analysis

For performance optimization, SASS is more relevant because it shows what actually executes. PTX is useful for understanding the logical structure of your kernel.

Using the Side-by-Side View

The side-by-side view lets you compare different outputs:

Click the side-by-side icon in the results header
Select what to show in each panel (Source, PTX, or SASS)
Scroll through to see how source code maps to assembly

What to Look For

Memory Access Patterns

Look for coalesced vs. scattered memory accesses:

LDG.E with consecutive addresses = coalesced (good)
Many LDG.E with varying offsets = scattered (potential issue)

High register usage limits occupancy. Look at the register declarations in PTX (.reg) and compare to your GPU’s register file size.

Instruction Mix

A healthy kernel has a balance of compute and memory instructions. Too many memory ops relative to compute may indicate memory-bound code.

Control Flow

Look for @P0 predicated instructions and branch instructions. Divergent control flow can hurt performance.

SASS Unavailable

If SASS output is not available:

nvdisasm required for SASS output

Solution: Ensure nvdisasm is installed and in your PATH. It’s included with the CUDA Toolkit.

# Verify nvdisasm
nvdisasm --version

PTX is always available when nvcc compiles successfully. SASS requires the additional nvdisasm tool.

Getting Started

NCU Profiler

Compiler Explorer

GPU Docs

More

Understanding PTX and SASS

Understanding PTX and SASS

PTX (Parallel Thread Execution)

What PTX Shows You

PTX Characteristics

Example PTX

Key PTX Instructions

SASS (Shader Assembly)

What SASS Shows You

SASS Characteristics

Example SASS

Key SASS Instructions

Comparing PTX and SASS

Using the Side-by-Side View

What to Look For

SASS Unavailable

Next Steps

NCU Profiler

GPU Docs

Getting Started

NCU Profiler

Compiler Explorer

GPU Docs

More

​Understanding PTX and SASS

​PTX (Parallel Thread Execution)

​What PTX Shows You

​PTX Characteristics

​Example PTX

​Key PTX Instructions

​SASS (Shader Assembly)

​What SASS Shows You

​SASS Characteristics

​Example SASS

​Key SASS Instructions

​Comparing PTX and SASS

​Using the Side-by-Side View

​What to Look For

​SASS Unavailable

​Next Steps

NCU Profiler

GPU Docs

Understanding PTX and SASS

PTX (Parallel Thread Execution)

What PTX Shows You

PTX Characteristics

Example PTX

Key PTX Instructions

SASS (Shader Assembly)

What SASS Shows You

SASS Characteristics

Example SASS

Key SASS Instructions

Comparing PTX and SASS

Using the Side-by-Side View

What to Look For

SASS Unavailable

Next Steps